Introduction

In this project, we are attempting to accurately predict whether a patient will have a stroke or not based on input variables such as age, gender, BMI, hypertension, glucose level, etc. We were originally attracted to this data set because it involves discovering practical insights that might ultimately lead to saving someone’s life. A stroke is a scary event for someone to go through, so by creating predictive models, we hope to get a better insight on not only which variables have the strongest effect on a patient’s probability of having a stroke, but also which variables are most significant to target when trying to prevent a risk of stroke. While somewhat rare, the effects of having a stroke can be life-changing. We believe that there are two main business applications to this project. The first is the most obvious connection: saving lives. Whether it be a family medicine doctor or an emergency room respondent, knowing how certain variables interact with the chance of stroke can lead to better prescriptions, better service, and more lives saved. The second business application is less obvious, but still quite relevant: insurance agencies. Insurance agencies can take the information of the patient, input it into the models created, and see the chance of stroke that the client may have. This can go into their modeling of medical insurance packages as well as the the placement of the client within the packages’ different level of coverage and risk premium.

Next, we will start by loading and cleaning the data so it is prepared for analysis. From there, we will be creating different graphs and visuals to see if certain input variables interact with each other or have a stronger interaction force with the response variable (stroke or not). If any strong correlations are discovered from the data visualization section, we will take note and implement different train and test splits on top of our general models. After data visualization, we start to create the general models: SVM, Decision Tree, KNN, Logistic Regression, and ANN. Within each of these model chunks, we will create a final model that involves all the variables, but broken into a normalized train and test split, and a confusion matrix that compares the model’s prediction to the test data. From there, we will also run more models and confusion matrices on different categorical splits. Some examples may be: married versus non-married, hypertension versus non-hypertension, whatever is discovered from the beginnings of our researching in data visualization. Ultimately, we will end with a stacked model that combines the predictions from our main general models and makes a prediction from those by assigning weights to the individual models’ predictions. Once the stacked model is made, we will improve it using the train() command learned in class. This will be paired with a confusion matrix of which we can compare performance statistics between the individual models and the stacked model. We will end with a summary of our findings.

Data Preparation

Library Loading

library(C50)
library(class)
library(gmodels)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(aod)
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## The following object is masked from 'package:lmtest':
## 
##     lrtest
## The following object is masked from 'package:caret':
## 
##     predictors
library(neuralnet)
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:VGAM':
## 
##     nvar
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(lsr)

Load in Data

stroke <- read.csv("healthcare-dataset-stroke-data.csv")
stroke <- stroke[stroke$gender != "Other",]
stroke$id <- NULL
stroke$gender <- as.factor(stroke$gender)
stroke$work_type <- as.factor(stroke$work_type)
stroke$smoking_status <- as.factor(stroke$smoking_status)
stroke$ever_married <- as.factor(stroke$ever_married)
stroke$Residence_type <- as.factor(stroke$Residence_type)
summary(stroke)
##     gender          age         hypertension     heart_disease     ever_married
##  Female:2994   Min.   : 0.08   Min.   :0.00000   Min.   :0.00000   No :1756    
##  Male  :2115   1st Qu.:25.00   1st Qu.:0.00000   1st Qu.:0.00000   Yes:3353    
##                Median :45.00   Median :0.00000   Median :0.00000               
##                Mean   :43.23   Mean   :0.09748   Mean   :0.05402               
##                3rd Qu.:61.00   3rd Qu.:0.00000   3rd Qu.:0.00000               
##                Max.   :82.00   Max.   :1.00000   Max.   :1.00000               
##          work_type    Residence_type avg_glucose_level     bmi           
##  children     : 687   Rural:2513     Min.   : 55.12    Length:5109       
##  Govt_job     : 657   Urban:2596     1st Qu.: 77.24    Class :character  
##  Never_worked :  22                  Median : 91.88    Mode  :character  
##  Private      :2924                  Mean   :106.14                      
##  Self-employed: 819                  3rd Qu.:114.09                      
##                                      Max.   :271.74                      
##          smoking_status     stroke       
##  formerly smoked: 884   Min.   :0.00000  
##  never smoked   :1892   1st Qu.:0.00000  
##  smokes         : 789   Median :0.00000  
##  Unknown        :1544   Mean   :0.04874  
##                         3rd Qu.:0.00000  
##                         Max.   :1.00000

We removed the one instance of gender being neither female nor male in this dataset, as one observation would not lead to a statistically significant result. We also removed the ID column, as that would bear no use to predicting strokes.

Create Linear Model to Predict Missing BMI

# create the bmi data that doesn't have NA
bmi_data <- stroke[stroke$bmi != "N/A",]
# can't use stroke to predict bmi
bmi_model <- lm(bmi ~ .-stroke, data = bmi_data)
summary(bmi_model)
## 
## Call:
## lm(formula = bmi ~ . - stroke, data = bmi_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.433  -4.530  -1.123   3.301  67.257 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                18.954575   0.492676  38.473  < 2e-16 ***
## genderMale                  0.093146   0.201706   0.462 0.644251    
## age                        -0.017011   0.007352  -2.314 0.020717 *  
## hypertension                2.498327   0.357115   6.996 2.99e-12 ***
## heart_disease              -0.891326   0.475227  -1.876 0.060773 .  
## ever_marriedYes             2.139592   0.288996   7.404 1.55e-13 ***
## work_typeGovt_job           8.615362   0.499707  17.241  < 2e-16 ***
## work_typeNever_worked       5.441532   1.495341   3.639 0.000277 ***
## work_typePrivate            8.625681   0.414045  20.833  < 2e-16 ***
## work_typeSelf-employed      8.146916   0.511614  15.924  < 2e-16 ***
## Residence_typeUrban        -0.003690   0.196079  -0.019 0.984985    
## avg_glucose_level           0.020095   0.002303   8.724  < 2e-16 ***
## smoking_statusnever smoked -0.418286   0.290396  -1.440 0.149819    
## smoking_statussmokes       -0.181496   0.350462  -0.518 0.604568    
## smoking_statusUnknown      -0.779020   0.330191  -2.359 0.018348 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.859 on 4893 degrees of freedom
## Multiple R-squared:  0.2395, Adjusted R-squared:  0.2373 
## F-statistic: 110.1 on 14 and 4893 DF,  p-value: < 2.2e-16
# remove insignificant variables
bmi_model <- lm(bmi ~ .-stroke-gender-Residence_type+age*hypertension*ever_married-heart_disease, data = bmi_data)
summary(bmi_model)
## 
## Call:
## lm(formula = bmi ~ . - stroke - gender - Residence_type + age * 
##     hypertension * ever_married - heart_disease, data = bmi_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.202  -4.403  -1.055   3.154  55.163 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       18.063859   0.469391  38.484  < 2e-16 ***
## age                                0.094458   0.013197   7.158 9.43e-13 ***
## hypertension                      30.586418   3.201200   9.555  < 2e-16 ***
## ever_marriedYes                    7.097632   0.645289  10.999  < 2e-16 ***
## work_typeGovt_job                  5.690280   0.557308  10.210  < 2e-16 ***
## work_typeNever_worked              4.478524   1.467994   3.051  0.00229 ** 
## work_typePrivate                   5.929696   0.476255  12.451  < 2e-16 ***
## work_typeSelf-employed             5.507016   0.559351   9.845  < 2e-16 ***
## avg_glucose_level                  0.021340   0.002249   9.487  < 2e-16 ***
## smoking_statusnever smoked        -0.496535   0.283977  -1.749  0.08044 .  
## smoking_statussmokes              -0.531685   0.344082  -1.545  0.12236    
## smoking_statusUnknown             -0.712403   0.323559  -2.202  0.02773 *  
## age:hypertension                  -0.478083   0.053620  -8.916  < 2e-16 ***
## age:ever_marriedYes               -0.137203   0.015470  -8.869  < 2e-16 ***
## hypertension:ever_marriedYes     -17.159100   3.625768  -4.733 2.28e-06 ***
## age:hypertension:ever_marriedYes   0.299800   0.059967   4.999 5.95e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.722 on 4892 degrees of freedom
## Multiple R-squared:  0.2699, Adjusted R-squared:  0.2676 
## F-statistic: 120.5 on 15 and 4892 DF,  p-value: < 2.2e-16
stroke$bmi[stroke$bmi == "N/A"] = predict(bmi_model, stroke[stroke$bmi == "N/A",])
stroke$bmi <- as.numeric(stroke$bmi)

A lot of values in the BMI column was left N/A, so to counter this issue, we made a linear model to try and predict BMI based on the other variables except stroke. The R squared value was less than 0.27, so the model doesn’t actually do a great job explaining variance in data. However, to keep more data usable in the later predictions, we decided to use the BMI calculations.

Define Helper Functions

set.seed(42)

normalize <- function(x) {
  if(max(x) - min(x) == 0){
    return (x)
  }
  return ((x - min(x)) / (max(x) - min(x)))
}

set_splits <- function(x, norm) {
  if(norm){
    x = as.data.frame(lapply(as.data.frame(model.matrix(~.-1,x)), normalize))
  }
  testrows <- sample(1:nrow(x), 0.25*nrow(x))
  return (list(x[-testrows,], x[testrows,]))
}

We decided to make some helper functions to normalize the data and create splits.

Visualizations

Initial Visualizations

par(mfrow=c(2,2))
barplot(prop.table(table(stroke$stroke, stroke$gender), 2), main="Gender vs Stroke", xlab="Gender", col=c("darkblue","red"))
barplot(prop.table(table(stroke$stroke, stroke$hypertension), 2), main="Hypertension vs Stroke", xlab="Hypertension", col=c("darkblue","red"))
barplot(prop.table(table(stroke$stroke, stroke$heart_disease), 2), main="Heart Disease vs Stroke", xlab="Heart Disease", col=c("darkblue","red"))
barplot(prop.table(table(stroke$stroke, stroke$ever_married), 2), main="Marriage vs Stroke", xlab="Marriage", col=c("darkblue","red"))

par(mfrow=c(2,2))
barplot(prop.table(table(stroke$stroke, stroke$work_type), 2), main="Work vs Stroke", xlab="Work", col=c("darkblue","red"))
barplot(prop.table(table(stroke$stroke, stroke$Residence_type), 2), main="Residence vs Stroke", xlab="Residence Type", col=c("darkblue","red"))
barplot(prop.table(table(stroke$stroke, stroke$smoking_status), 2), main="Smoker vs Stroke", xlab="Smoking Status", col=c("darkblue","red"))
barplot(prop.table(table(stroke$stroke, quantileCut(stroke$age, 10)), 2), main="Age vs Stroke", xlab="Age", col=c("darkblue","red"))

par(mfrow=c(2,2))
barplot(prop.table(table(stroke$stroke, quantileCut(stroke$bmi, 10)), 2), main="BMI vs Stroke", xlab="BMI", col=c("darkblue","red"))
barplot(prop.table(table(stroke$stroke, quantileCut(stroke$avg_glucose_level, 10)), 2), main="Glucose vs Stroke", xlab="Glucose Level", col=c("darkblue","red"))

First, we wanted to visualize how the data related to the variable of interest: stroke or no stroke. Looking at the plots created, there is a higher proportion of people getting a stroke from those with heart disease, hypertension, and marriage. However, all these attributes are related to age, which also leads to higher proportions of strokes as it increases. Residence type doesn’t change the proportion of strokes, and smoking and work status doesn’t show a strong correlation. Glucose levels only has a large impact at the very extreme ends, but oddly enough, average levels of BMI led to higher proportions of stroke.

Inter-Variable Visualization

prop.table(table(stroke$ever_married, stroke$hypertension), 1)
##      
##                0          1
##   No  0.96981777 0.03018223
##   Yes 0.86728303 0.13271697
prop.table(table(stroke$ever_married, stroke$heart_disease), 1)
##      
##                0          1
##   No  0.98177677 0.01822323
##   Yes 0.92722935 0.07277065
prop.table(table(stroke$hypertension, stroke$heart_disease), 1)
##    
##              0          1
##   0 0.95402299 0.04597701
##   1 0.87148594 0.12851406
par(mfrow=c(2,2))
barplot(prop.table(table(stroke$ever_married, quantileCut(stroke$age, 10)), 2), main="Age vs Marriage", xlab="Age", col=c("darkblue","red"))
barplot(prop.table(table(stroke$hypertension, quantileCut(stroke$age, 10)), 2), main="Age vs Hypertension", xlab="Age", col=c("darkblue","red"))
barplot(prop.table(table(stroke$heart_disease, quantileCut(stroke$age, 10)), 2), main="Age vs Heart Disease", xlab="Age", col=c("darkblue","red"))
par(mfrow=c(2,2))

barplot(prop.table(table(stroke$heart_disease[stroke$ever_married=="Yes"], cut(stroke$age[stroke$ever_married=="Yes"], breaks = c(0,10,20,30,40,50,60,70,80,90))), 2), main="Age vs Heart Disease for Married", xlab="Age", col=c("darkblue","red"))
barplot(prop.table(table(stroke$heart_disease[stroke$ever_married=="No"], cut(stroke$age[stroke$ever_married=="No"], breaks = c(0,10,20,30,40,50,60,70,80,90))), 2), main="Age vs Heart Disease for Unmarried", xlab="Age", col=c("darkblue","red"))
barplot(prop.table(table(stroke$hypertension[stroke$ever_married=="Yes"], cut(stroke$age[stroke$ever_married=="Yes"], breaks = c(0,10,20,30,40,50,60,70,80,90))), 2), main="Age vs Hypertension for Married", xlab="Age", col=c("darkblue","red"))
barplot(prop.table(table(stroke$hypertension[stroke$ever_married=="No"], cut(stroke$age[stroke$ever_married=="No"], breaks = c(0,10,20,30,40,50,60,70,80,90))), 2), main="Age vs Hypertension for Unmarried", xlab="Age", col=c("darkblue","red"))

First, looking at how marriage, hypertension, and heart disease relate to one another, it seems the effect is negligible. Both factors more strongly correlate to age. As shown in the first set of graphs, increasing age leads to much higher chances of marriage, heart disease, and hypertension. One of the factors that was suggested to correlate to hypertension and heart disease was marriage, and factoring in the effect of age, we can see in the second set of graphs that being married and unmarried did not significantly change the proportion of people who had a stroke then.

Thus, we decided to split based on BMI and Age in our models, since those two variables seem to have a large effect on probability of stroke. The BMI range of between 26.5 to 32 seemed to have a higher proportion of strokes, and being older than 55 also led to much higher chances, so having seperate models will hopefully improve accuracy overall.

Create Splits

general_norm <- set_splits(stroke, norm = TRUE)
age_young_data_norm = set_splits(stroke[stroke$age<=55,], norm = TRUE)
age_old_data_norm = set_splits(stroke[stroke$age>55,], norm = TRUE)
bmi_risk_data_norm = set_splits(stroke[stroke$bmi>26.5 & stroke$bmi<32,], norm = TRUE)
bmi_safe_data_norm = set_splits(stroke[stroke$bmi<=26.5 | stroke$bmi>=32,], norm = TRUE)

SVM Model

We wanted to run a SVM model to see that if can accurately classify “stroke” or “no stroke” based on the variables given. SVM models are known to be quite accurate in classification scenarios. Because our data set is quite clean and somewhat small, we believe that an SVM data set can accurately predict “stroke” or “no stroke.” It is also quite efficient given that it uses a subset of training points rather than all individual training points.

General Model

stroke_svm <- ksvm(stroke ~ ., data = general_norm[[1]], kernel = "rbfdot")
svm_pred <- predict(stroke_svm, general_norm[[2]])
cutoff = quantile(svm_pred, 1-mean(general_norm[[1]]$stroke))
svm_pred = ifelse(svm_pred>cutoff, 1, 0)
confusionMatrix(as.factor(svm_pred), as.factor(general_norm[[2]]$stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1159   57
##          1   52    9
##                                          
##                Accuracy : 0.9146         
##                  95% CI : (0.898, 0.9294)
##     No Information Rate : 0.9483         
##     P-Value [Acc > NIR] : 1.0000         
##                                          
##                   Kappa : 0.0969         
##                                          
##  Mcnemar's Test P-Value : 0.7016         
##                                          
##             Sensitivity : 0.9571         
##             Specificity : 0.1364         
##          Pos Pred Value : 0.9531         
##          Neg Pred Value : 0.1475         
##              Prevalence : 0.9483         
##          Detection Rate : 0.9076         
##    Detection Prevalence : 0.9522         
##       Balanced Accuracy : 0.5467         
##                                          
##        'Positive' Class : 0              
## 

The SVM model was used with RBF kernel and a cutoff according to the proportion of people in the train data who suffered a stroke. This led to a kappa statistic of 0.0969, very low and indicative of a poor model. Next we want to see if splitting the data based on age and BMI would increase the model’s accuracy or kappa statistic.

SVM Age Based Model

# first is young model
young_svm <- ksvm(stroke ~ ., data = age_young_data_norm[[1]], kernel = "rbfdot")
young_pred <- predict(young_svm, age_young_data_norm[[2]])
cutoff = quantile(young_pred, 1-mean(age_young_data_norm[[1]]$stroke))
young_pred = ifelse(young_pred>cutoff, 1, 0)
# now old model
old_svm <- ksvm(stroke ~ ., data = age_old_data_norm[[1]], kernel = "rbfdot")
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
old_pred <- predict(old_svm, age_old_data_norm[[2]])
cutoff = quantile(old_pred, 1-mean(age_old_data_norm[[1]]$stroke))
old_pred = ifelse(old_pred>cutoff, 1, 0)

# now combine them together
combined_svmage_pred = c(young_pred, old_pred)
# get the true labels
true_label = c(age_young_data_norm[[2]]$stroke, age_old_data_norm[[2]]$stroke)
confusionMatrix(as.factor(combined_svmage_pred), as.factor(true_label))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1158   58
##          1   50   11
##                                           
##                Accuracy : 0.9154          
##                  95% CI : (0.8988, 0.9301)
##     No Information Rate : 0.946           
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.1249          
##                                           
##  Mcnemar's Test P-Value : 0.5006          
##                                           
##             Sensitivity : 0.9586          
##             Specificity : 0.1594          
##          Pos Pred Value : 0.9523          
##          Neg Pred Value : 0.1803          
##              Prevalence : 0.9460          
##          Detection Rate : 0.9068          
##    Detection Prevalence : 0.9522          
##       Balanced Accuracy : 0.5590          
##                                           
##        'Positive' Class : 0               
## 

When we create two models based on age and combine their results, we can see that the kappa statistic increases to 0.1249, a lot higher than that of the general model.

SVM BMI Based Model

risk_svm <- ksvm(stroke ~ ., data = bmi_risk_data_norm[[1]], kernel = "rbfdot")
risk_pred <- predict(risk_svm, bmi_risk_data_norm[[2]])
cutoff = quantile(risk_pred, 1-mean(bmi_risk_data_norm[[1]]$stroke))
risk_pred = ifelse(risk_pred>cutoff, 1, 0)

safe_svm <- ksvm(stroke ~ ., data = bmi_safe_data_norm[[1]], kernel = "rbfdot")
safe_pred <- predict(safe_svm, bmi_safe_data_norm[[2]])
cutoff = quantile(safe_pred, 1-mean(bmi_safe_data_norm[[1]]$stroke))
safe_pred = ifelse(safe_pred>cutoff, 1, 0)

combined_svmbmipred = c(risk_pred, safe_pred)
true_label = c(bmi_risk_data_norm[[2]]$stroke, bmi_safe_data_norm[[2]]$stroke)
confusionMatrix(as.factor(combined_svmbmipred), as.factor(true_label))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1165   50
##          1   47   15
##                                          
##                Accuracy : 0.924          
##                  95% CI : (0.9081, 0.938)
##     No Information Rate : 0.9491         
##     P-Value [Acc > NIR] : 0.9999         
##                                          
##                   Kappa : 0.1963         
##                                          
##  Mcnemar's Test P-Value : 0.8391         
##                                          
##             Sensitivity : 0.9612         
##             Specificity : 0.2308         
##          Pos Pred Value : 0.9588         
##          Neg Pred Value : 0.2419         
##              Prevalence : 0.9491         
##          Detection Rate : 0.9123         
##    Detection Prevalence : 0.9514         
##       Balanced Accuracy : 0.5960         
##                                          
##        'Positive' Class : 0              
## 

The best kappa statistic achieved by the SVM is when splitting on BMI, with resulting kappa of 0.1963. It seems that normal BMI range has different predictors of stroke likelihood than the two tails, as does being older compared to younger. Overall though, the SVM was not a great classifier of strokes. With the use of RBF kernels, the coefficients also become un-interpretable, so SVM in this task was not particularly useful.

Decision Tree

Next, we want to run a Decision Tree model to see if it would perform good predictions. The benefits of knn model include its managerial insights on which variables have the biggest impact on whether or not a person has a stroke.

#rebalance errors - want to prioritize minimizing false negatives, if people are at risk of stroke we want to know, used 20 to make it about 20% of true positives show up as negatives. this is a number that we decided on as a group in which we would feel comfortable with if we were the doctor or hospital giving advice
error_cost <- matrix(c(0, 1, 20, 0), nrow = 2)

#build decision tree
stroke_dt <- C5.0(as.factor(stroke) ~ ., data = general_norm[[1]], costs = error_cost)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(stroke_dt)

summary(stroke_dt)
## 
## Call:
## C5.0.formula(formula = as.factor(stroke) ~ ., data = general_norm[[1]], costs
##  = error_cost)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr 27 16:28:01 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3832 cases (17 attributes) from undefined.data
## Read misclassification costs from undefined.costs
## 
## Decision tree:
## 
## age > 0.6459961:
## :...age > 0.8168945:
## :   :...hypertension > 0: 1 (163/123)
## :   :   hypertension <= 0:
## :   :   :...bmi <= 0.2726232: 1 (389/323)
## :   :       bmi > 0.2726232:
## :   :       :...smoking_statusnever.smoked <= 0: 0 (42)
## :   :           smoking_statusnever.smoked > 0: 1 (31/27)
## :   age <= 0.8168945:
## :   :...avg_glucose_level > 0.2373742:
## :       :...smoking_statusnever.smoked <= 0: 1 (194/162)
## :       :   smoking_statusnever.smoked > 0:
## :       :   :...avg_glucose_level <= 0.5801865: 0 (46)
## :       :       avg_glucose_level > 0.5801865: 1 (57/52)
## :       avg_glucose_level <= 0.2373742:
## :       :...smoking_statusnever.smoked > 0:
## :           :...avg_glucose_level > 0.07737051: 0 (113)
## :           :   avg_glucose_level <= 0.07737051:
## :           :   :...avg_glucose_level <= 0.0594128: 0 (36)
## :           :       avg_glucose_level > 0.0594128: 1 (12/10)
## :           smoking_statusnever.smoked <= 0:
## :           :...genderFemale <= 0:
## :               :...avg_glucose_level <= 0.06070538: 0 (29)
## :               :   avg_glucose_level > 0.06070538: 1 (107/97)
## :               genderFemale > 0:
## :               :...work_typePrivate <= 0: 0 (67)
## :                   work_typePrivate > 0:
## :                   :...bmi <= 0.2325315: 1 (64/60)
## :                       bmi > 0.2325315: 0 (41)
## age <= 0.6459961:
## :...age <= 0.4506836: 0 (1519/2)
##     age > 0.4506836:
##     :...avg_glucose_level > 0.7791525: 1 (17/15)
##         avg_glucose_level <= 0.7791525:
##         :...avg_glucose_level > 0.5360078: 0 (77)
##             avg_glucose_level <= 0.5360078:
##             :...avg_glucose_level > 0.4764565: 1 (16/14)
##                 avg_glucose_level <= 0.4764565:
##                 :...genderFemale <= 0:
##                     :...bmi <= 0.3482245: 0 (271)
##                     :   bmi > 0.3482245: 1 (28/25)
##                     genderFemale > 0:
##                     :...bmi > 0.2445829: 0 (196/1)
##                         bmi <= 0.2445829:
##                         :...avg_glucose_level > 0.2233404: 0 (88)
##                             avg_glucose_level <= 0.2233404:
##                             :...bmi > 0.1821306: 1 (110/101)
##                                 bmi <= 0.1821306:
##                                 :...bmi <= 0.1168385: 1 (13/12)
##                                     bmi > 0.1168385: 0 (106)
## 
## 
## Evaluation on training data (3832 cases):
## 
##         Decision Tree       
##    -----------------------  
##    Size      Errors   Cost  
## 
##      26 1024(26.7%)   0.28   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    2628  1021    (a): class 0
##       3   180    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% age
##   44.05% avg_glucose_level
##   35.99% bmi
##   29.23% genderFemale
##   21.89% smoking_statusnever.smoked
##   16.31% hypertension
##    4.49% work_typePrivate
## 
## 
## Time: 0.0 secs
#predict the test data
stroke_dt_pred <- predict(stroke_dt, general_norm[[2]])
CrossTable(general_norm[[2]]$stroke, stroke_dt_pred)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1277 
## 
##  
##                          | stroke_dt_pred 
## general_norm[[2]]$stroke |         0 |         1 | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                        0 |       874 |       337 |      1211 | 
##                          |     1.136 |     2.603 |           | 
##                          |     0.722 |     0.278 |     0.948 | 
##                          |     0.983 |     0.869 |           | 
##                          |     0.684 |     0.264 |           | 
## -------------------------|-----------|-----------|-----------|
##                        1 |        15 |        51 |        66 | 
##                          |    20.844 |    47.758 |           | 
##                          |     0.227 |     0.773 |     0.052 | 
##                          |     0.017 |     0.131 |           | 
##                          |     0.012 |     0.040 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |       889 |       388 |      1277 | 
##                          |     0.696 |     0.304 |           | 
## -------------------------|-----------|-----------|-----------|
## 
## 
# Confusion Matrix and Kappa Statistics


confusionMatrix(stroke_dt_pred, as.factor(general_norm[[2]]$stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 874  15
##          1 337  51
##                                          
##                Accuracy : 0.7244         
##                  95% CI : (0.699, 0.7487)
##     No Information Rate : 0.9483         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1495         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7217         
##             Specificity : 0.7727         
##          Pos Pred Value : 0.9831         
##          Neg Pred Value : 0.1314         
##              Prevalence : 0.9483         
##          Detection Rate : 0.6844         
##    Detection Prevalence : 0.6962         
##       Balanced Accuracy : 0.7472         
##                                          
##        'Positive' Class : 0              
## 

Our general Decision Tree model has a kappa value of 0.1495, meaning that using this model by itself would predict stokes 14.95% better than if it was just random. The accuracy of the model is 72.44% overall, and out of 66 patients who do have stroke, our model predicted 51 of them correctly (77.27% accuracy). Next we want to see if splitting the data based on age and BMI would increase the model’s accuracy or kappa statistic for our Decision Tree model.

Decision Trees Age Based Model

summary(age_old_data_norm[[1]])
##   genderFemale      genderMale          age          hypertension   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.1923   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.4231   Median :0.0000  
##  Mean   :0.5723   Mean   :0.4277   Mean   :0.4728   Mean   :0.1989  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.7692   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  heart_disease    ever_marriedYes  work_typeGovt_job work_typeNever_worked
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000    Min.   :0            
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000    1st Qu.:0            
##  Median :0.0000   Median :1.0000   Median :0.0000    Median :0            
##  Mean   :0.1376   Mean   :0.9198   Mean   :0.1478    Mean   :0            
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000    3rd Qu.:0            
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000    Max.   :0            
##  work_typePrivate work_typeSelf.employed Residence_typeUrban avg_glucose_level
##  Min.   :0.0000   Min.   :0.0000         Min.   :0.0000      Min.   :0.0000   
##  1st Qu.:0.0000   1st Qu.:0.0000         1st Qu.:0.0000      1st Qu.:0.1093   
##  Median :1.0000   Median :0.0000         Median :1.0000      Median :0.1892   
##  Mean   :0.5511   Mean   :0.3011         Mean   :0.5228      Mean   :0.2989   
##  3rd Qu.:1.0000   3rd Qu.:1.0000         3rd Qu.:1.0000      3rd Qu.:0.4555   
##  Max.   :1.0000   Max.   :1.0000         Max.   :1.0000      Max.   :0.9809   
##       bmi         smoking_statusnever.smoked smoking_statussmokes
##  Min.   :0.0000   Min.   :0.0000             Min.   :0.0000      
##  1st Qu.:0.3065   1st Qu.:0.0000             1st Qu.:0.0000      
##  Median :0.3710   Median :0.0000             Median :0.0000      
##  Mean   :0.3839   Mean   :0.3868             Mean   :0.1462      
##  3rd Qu.:0.4476   3rd Qu.:1.0000             3rd Qu.:0.0000      
##  Max.   :1.0000   Max.   :1.0000             Max.   :1.0000      
##  smoking_statusUnknown     stroke      
##  Min.   :0.00          Min.   :0.0000  
##  1st Qu.:0.00          1st Qu.:0.0000  
##  Median :0.00          Median :0.0000  
##  Mean   :0.18          Mean   :0.1171  
##  3rd Qu.:0.00          3rd Qu.:0.0000  
##  Max.   :1.00          Max.   :1.0000
#Young Patients
ageYoung_dt <- C5.0(as.factor(stroke) ~ ., data = age_young_data_norm[[1]], costs = error_cost)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(ageYoung_dt)

summary(ageYoung_dt)
## 
## Call:
## C5.0.formula(formula = as.factor(stroke) ~ ., data =
##  age_young_data_norm[[1]], costs = error_cost)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr 27 16:28:03 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 2560 cases (17 attributes) from undefined.data
## Read misclassification costs from undefined.costs
## 
## Decision tree:
## 
## age <= 0.6722506: 0 (1528/2)
## age > 0.6722506:
## :...hypertension > 0:
##     :...genderFemale > 0: 1 (53/45)
##     :   genderFemale <= 0:
##     :   :...bmi <= 0.3654066: 0 (44)
##     :       bmi > 0.3654066: 1 (8/7)
##     hypertension <= 0:
##     :...avg_glucose_level <= 0.02689992: 0 (43)
##         avg_glucose_level > 0.02689992:
##         :...avg_glucose_level <= 0.04270128: 1 (30/27)
##             avg_glucose_level > 0.04270128:
##             :...avg_glucose_level <= 0.08921181: 0 (120)
##                 avg_glucose_level > 0.08921181:
##                 :...work_typeSelf.employed > 0:
##                     :...age <= 0.7086672: 1 (8/6)
##                     :   age > 0.7086672:
##                     :   :...age > 0.9635834: 1 (13/11)
##                     :       age <= 0.9635834:
##                     :       :...bmi <= 0.4009164: 0 (80)
##                     :           bmi > 0.4009164: 1 (6/5)
##                     work_typeSelf.employed <= 0:
##                     :...age <= 0.7997087: 0 (255/2)
##                         age > 0.7997087:
##                         :...smoking_statusUnknown > 0: 0 (77)
##                             smoking_statusUnknown <= 0:
##                             :...age > 0.9271668: 0 (122/1)
##                                 age <= 0.9271668:
##                                 :...avg_glucose_level <= 0.09038751: 1 (1)
##                                     avg_glucose_level > 0.09038751:
##                                     :...bmi <= 0.1935853: 0 (54)
##                                         bmi > 0.1935853: 1 (118/110)
## 
## 
## Evaluation on training data (2560 cases):
## 
##         Decision Tree       
##    -----------------------  
##    Size      Errors   Cost  
## 
##      17  216( 8.4%)   0.12   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    2318   211    (a): class 0
##       5    26    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% age
##   40.31% hypertension
##   36.21% avg_glucose_level
##   28.67% work_typeSelf.employed
##   14.53% smoking_statusUnknown
##   12.11% bmi
##    4.10% genderFemale
## 
## 
## Time: 0.0 secs
#predict the test data
ageYoung_dt_pred <- predict(ageYoung_dt, age_young_data_norm[[2]])


#Old Patients
ageOld_dt <- C5.0(as.factor(stroke) ~ ., data = age_old_data_norm[[1]], costs = error_cost)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(ageOld_dt)

summary(ageOld_dt)
## 
## Call:
## C5.0.formula(formula = as.factor(stroke) ~ ., data =
##  age_old_data_norm[[1]], costs = error_cost)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr 27 16:28:03 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 1272 cases (17 attributes) from undefined.data
## Read misclassification costs from undefined.costs
## 
## Decision tree:
## 
## age > 0.6538461: 1 (408/326)
## age <= 0.6538461:
## :...bmi <= 0.3185484: 0 (220/5)
##     bmi > 0.3185484:
##     :...avg_glucose_level <= 0.2549074:
##         :...Residence_typeUrban <= 0:
##         :   :...bmi <= 0.3682302: 1 (51/46)
##         :   :   bmi > 0.3682302:
##         :   :   :...avg_glucose_level <= 0.01902914: 1 (5/4)
##         :   :       avg_glucose_level > 0.01902914: 0 (128)
##         :   Residence_typeUrban > 0:
##         :   :...avg_glucose_level <= 0.169461:
##         :       :...avg_glucose_level <= 0.05884255: 0 (29)
##         :       :   avg_glucose_level > 0.05884255: 1 (107/92)
##         :       avg_glucose_level > 0.169461:
##         :       :...bmi <= 0.3205645: 1 (2/1)
##         :           bmi > 0.3205645: 0 (66)
##         avg_glucose_level > 0.2549074:
##         :...smoking_statussmokes > 0: 1 (44/39)
##             smoking_statussmokes <= 0:
##             :...age > 0.2307692: 1 (108/84)
##                 age <= 0.2307692:
##                 :...bmi <= 0.4939516: 1 (62/51)
##                     bmi > 0.4939516: 0 (42)
## 
## 
## Evaluation on training data (1272 cases):
## 
##         Decision Tree       
##    -----------------------  
##    Size      Errors   Cost  
## 
##      13  648(50.9%)   0.58   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     480   643    (a): class 0
##       5   144    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% age
##   67.92% bmi
##   50.63% avg_glucose_level
##   30.50% Residence_typeUrban
##   20.13% smoking_statussmokes
## 
## 
## Time: 0.0 secs
#predict the test data
ageOld_dt_pred <- predict(ageOld_dt, age_old_data_norm[[2]])


#Combined confusion matrix
combinedagepredict <- c(ageOld_dt_pred,ageYoung_dt_pred)-1
summary(combinedagepredict)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2803  1.0000  1.0000
combinedageactual <- c(age_old_data_norm[[2]]$stroke, age_young_data_norm[[2]]$stroke)
summary(combinedageactual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05403 0.00000 1.00000
confusionMatrix(as.factor(combinedagepredict), as.factor(combinedageactual))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 897  22
##          1 311  47
##                                           
##                Accuracy : 0.7392          
##                  95% CI : (0.7142, 0.7631)
##     No Information Rate : 0.946           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1424          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7425          
##             Specificity : 0.6812          
##          Pos Pred Value : 0.9761          
##          Neg Pred Value : 0.1313          
##              Prevalence : 0.9460          
##          Detection Rate : 0.7024          
##    Detection Prevalence : 0.7197          
##       Balanced Accuracy : 0.7119          
##                                           
##        'Positive' Class : 0               
## 

When we create a Decision Tree model based on an age split, we can see that accuracy increases a little bit while the kappa statistic decreases a little bit. This indicates about the same performance as the general model. However, it does show that those who are younger have different predictor variables for stroke when compared to those that are older. For example, For the younger group, age, hypertension, and average glucose levels were the significant variables in predicting stroke while for the older group, age, bmi, and average glucose level were the significant variables. This could mean that as a patient gets older, their BMI is a greater indicator of other potential health problems that may also affect stroke levels, or that it becomes a more prominent indicator of overall health and stroke chances.

Decision Trees BMI Based Model

# separate data for further insight

#Risky BMI Patients

bmiRisk_dt <- C5.0(as.factor(stroke) ~ ., data = bmi_risk_data_norm[[1]], costs = error_cost)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(bmiRisk_dt)

summary(bmiRisk_dt)
## 
## Call:
## C5.0.formula(formula = as.factor(stroke) ~ ., data =
##  bmi_risk_data_norm[[1]], costs = error_cost)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr 27 16:28:05 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 1177 cases (17 attributes) from undefined.data
## Read misclassification costs from undefined.costs
## 
## Decision tree:
## 
## age <= 0.525:
## :...avg_glucose_level <= 0.01288565: 1 (8/7)
## :   avg_glucose_level > 0.01288565: 0 (434/2)
## age > 0.525:
## :...age > 0.8125: 1 (265/215)
##     age <= 0.8125:
##     :...bmi > 0.9640023: 1 (11/7)
##         bmi <= 0.9640023:
##         :...bmi <= 0.111231: 0 (49)
##             bmi > 0.111231:
##             :...hypertension > 0: 1 (39/33)
##                 hypertension <= 0:
##                 :...avg_glucose_level > 0.7869019: 1 (6/4)
##                     avg_glucose_level <= 0.7869019:
##                     :...bmi <= 0.1334898: 1 (8/6)
##                         bmi > 0.1334898:
##                         :...age <= 0.55: 1 (22/19)
##                             age > 0.55:
##                             :...age > 0.75: 1 (73/67)
##                                 age <= 0.75:
##                                 :...work_typeSelf.employed > 0: 0 (51)
##                                     work_typeSelf.employed <= 0:
##                                     :...avg_glucose_level <= 0.2167929: 1 (142/135)
##                                         avg_glucose_level > 0.2167929: 0 (69)
## 
## 
## Evaluation on training data (1177 cases):
## 
##         Decision Tree       
##    -----------------------  
##    Size      Errors   Cost  
## 
##      13  495(42.1%)   0.45   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     601   493    (a): class 0
##       2    81    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% age
##   69.07% avg_glucose_level
##   39.93% bmi
##   34.83% hypertension
##   22.26% work_typeSelf.employed
## 
## 
## Time: 0.0 secs
#predict the test data
bmiRisk_dt_pred <- predict(bmiRisk_dt, bmi_risk_data_norm[[2]])


#Not Risky BMI Patients

bmiNotRisk_dt <- C5.0(as.factor(stroke) ~ ., data = bmi_safe_data_norm[[1]], costs = error_cost)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(bmiNotRisk_dt)

summary(bmiNotRisk_dt)
## 
## Call:
## C5.0.formula(formula = as.factor(stroke) ~ ., data =
##  bmi_safe_data_norm[[1]], costs = error_cost)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr 27 16:28:05 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 2655 cases (17 attributes) from undefined.data
## Read misclassification costs from undefined.costs
## 
## Decision tree:
## 
## age <= 0.6337891: 0 (1789/8)
## age > 0.6337891:
## :...age > 0.8413086: 1 (333/271)
##     age <= 0.8413086:
##     :...heart_disease > 0: 1 (47/39)
##         heart_disease <= 0:
##         :...age <= 0.7192383:
##             :...bmi <= 0.1534937: 0 (38)
##             :   bmi > 0.1534937:
##             :   :...age <= 0.6826172:
##             :       :...avg_glucose_level > 0.6068473: 1 (20/17)
##             :       :   avg_glucose_level <= 0.6068473:
##             :       :   :...smoking_statusnever.smoked <= 0: 0 (54)
##             :       :       smoking_statusnever.smoked > 0: 1 (34/32)
##             :       age > 0.6826172:
##             :       :...smoking_statusnever.smoked <= 0: 1 (49/39)
##             :           smoking_statusnever.smoked > 0:
##             :           :...avg_glucose_level <= 0.07277628: 1 (6/4)
##             :               avg_glucose_level > 0.07277628: 0 (32)
##             age > 0.7192383:
##             :...age <= 0.7558594: 0 (100)
##                 age > 0.7558594:
##                 :...ever_marriedYes <= 0: 1 (5/4)
##                     ever_marriedYes > 0:
##                     :...Residence_typeUrban > 0: 0 (64)
##                         Residence_typeUrban <= 0:
##                         :...smoking_statusnever.smoked <= 0: 1 (46/42)
##                             smoking_statusnever.smoked > 0:
##                             :...avg_glucose_level <= 0.8439495: 0 (36)
##                                 avg_glucose_level > 0.8439495: 1 (2/1)
## 
## 
## Evaluation on training data (2655 cases):
## 
##         Decision Tree       
##    -----------------------  
##    Size      Errors   Cost  
## 
##      16  457(17.2%)   0.23   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    2105   449    (a): class 0
##       8    93    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% age
##   20.08% heart_disease
##    9.76% smoking_statusnever.smoked
##    8.78% bmi
##    6.93% avg_glucose_level
##    5.76% ever_marriedYes
##    5.57% Residence_typeUrban
## 
## 
## Time: 0.0 secs
#predict the test data
bmiNotRisk_dt_pred <- predict(bmiNotRisk_dt, bmi_safe_data_norm[[2]])

#Combined confusion matrix
combinedbmipredict <- c(bmiNotRisk_dt_pred,bmiRisk_dt_pred)-1
summary(combinedbmipredict)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2984  1.0000  1.0000
combinedbmiactual <- c(bmi_safe_data_norm[[2]]$stroke, bmi_risk_data_norm[[2]]$stroke)
summary(combinedbmiactual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0509  0.0000  1.0000
confusionMatrix(as.factor(combinedbmipredict), as.factor(combinedbmiactual))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 884  12
##          1 328  53
##                                           
##                Accuracy : 0.7338          
##                  95% CI : (0.7086, 0.7578)
##     No Information Rate : 0.9491          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1651          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7294          
##             Specificity : 0.8154          
##          Pos Pred Value : 0.9866          
##          Neg Pred Value : 0.1391          
##              Prevalence : 0.9491          
##          Detection Rate : 0.6922          
##    Detection Prevalence : 0.7016          
##       Balanced Accuracy : 0.7724          
##                                           
##        'Positive' Class : 0               
## 

When we create a Decision Tree model based on bmi split, we can see that accuracy increases a little bit while the kappa statistic increases by a more significant amount. This indicates that the bmi split performs a little bit better than the general model. In addition, it does show that those who have risky bmi levels have different predictor variables for stroke when compared to those that have not as risky bmi levels. For example, for those that have risky bmi levels, age, average glucose level, bmi, and hypertension were the significant variables in predicting stroke while for the non risky bmi group, age, heart disease, and smoking status were the significant variables. Overall, the confusion matrix output leads us to believe that predicting stroke based on different BMI categories is more effective than age and the general model. The thresholds for the BMI splits were created in similar fashion to the age splits where they were set to optimize kappa but ultimately led to predicting all “stroke” or “no stroke” depending on the BMI category.

KNN Model

Next, we want to run a knn model use max-min normalization to see if it would perform good predictions. The benefits of knn model include its accuracy and fast execution time in predicting whether a person will have a stroke or not.

set.seed(12345)
stroketrain_knn <- general_norm[[1]]
stroketrain_knn$stroke = NULL
stroketest_knn <- general_norm[[2]]
stroketest_knn$stroke = NULL

knn_train_labels <- general_norm[[1]]$stroke
knn_test_labels <- general_norm[[2]]$stroke

stroke_knn_test_pred <- knn(train = stroketrain_knn, test = stroketest_knn,
                      cl = knn_train_labels, k= 4)

confusionMatrix(as.factor(stroke_knn_test_pred), as.factor(knn_test_labels))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1198   62
##          1   13    4
##                                           
##                Accuracy : 0.9413          
##                  95% CI : (0.9269, 0.9535)
##     No Information Rate : 0.9483          
##     P-Value [Acc > NIR] : 0.8837          
##                                           
##                   Kappa : 0.0768          
##                                           
##  Mcnemar's Test P-Value : 2.981e-08       
##                                           
##             Sensitivity : 0.98927         
##             Specificity : 0.06061         
##          Pos Pred Value : 0.95079         
##          Neg Pred Value : 0.23529         
##              Prevalence : 0.94832         
##          Detection Rate : 0.93814         
##    Detection Prevalence : 0.98669         
##       Balanced Accuracy : 0.52494         
##                                           
##        'Positive' Class : 0               
## 

Our general KNN model using max-min normalization has a kappa value of 0.0768, with an accuracy of 0.9413. The value of k used was very small of 4, since larger numbers led to the effect of the nearest neighbors being diminished and always predicting 0 (no stroke), since the majority of the data is no stroke. Next we want to see if splitting the data based on age and BMI would increase the model’s accuracy or kappa statistic for our KNN model.

KNN Model Age Based Model

###Young patients
stroketrain_knn_young <- age_young_data_norm[[1]]
stroketrain_knn_young$stroke = NULL
stroketest_knn_young <- age_young_data_norm[[2]]
stroketest_knn_young$stroke = NULL
knn_train_labels_young <- age_young_data_norm[[1]]$stroke
knn_test_labels_young <- age_young_data_norm[[2]]$stroke

stroke_knn_pred_young <- knn(train = stroketrain_knn_young, test = stroketest_knn_young,
                      cl = knn_train_labels_young, k= 4)
confusionMatrix(as.factor(stroke_knn_pred_young), as.factor(knn_test_labels_young))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 845   8
##          1   0   0
##                                           
##                Accuracy : 0.9906          
##                  95% CI : (0.9816, 0.9959)
##     No Information Rate : 0.9906          
##     P-Value [Acc > NIR] : 0.59255         
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.01333         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9906          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9906          
##          Detection Rate : 0.9906          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 
###Old patients
stroketrain_knn_old <- age_old_data_norm[[1]]
stroketrain_knn_old$stroke = NULL
stroketest_knn_old <- age_old_data_norm[[2]]
stroketest_knn_old$stroke = NULL
knn_train_labels_old <- age_old_data_norm[[1]]$stroke
knn_test_labels_old <- age_old_data_norm[[2]]$stroke

stroke_knn_pred_old <- knn(train = stroketrain_knn_old, test = stroketest_knn_old,
                      cl = knn_train_labels_old, k = 4)
confusionMatrix(as.factor(stroke_knn_pred_old), as.factor(knn_test_labels_old))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 348  57
##          1  15   4
##                                          
##                Accuracy : 0.8302         
##                  95% CI : (0.791, 0.8647)
##     No Information Rate : 0.8561         
##     P-Value [Acc > NIR] : 0.9416         
##                                          
##                   Kappa : 0.034          
##                                          
##  Mcnemar's Test P-Value : 1.352e-06      
##                                          
##             Sensitivity : 0.95868        
##             Specificity : 0.06557        
##          Pos Pred Value : 0.85926        
##          Neg Pred Value : 0.21053        
##              Prevalence : 0.85613        
##          Detection Rate : 0.82075        
##    Detection Prevalence : 0.95519        
##       Balanced Accuracy : 0.51213        
##                                          
##        'Positive' Class : 0              
## 
##Combine the results together
knn_age_combined_pred = c(stroke_knn_pred_young, stroke_knn_pred_old) - 1
true_label_age = c(age_young_data_norm[[2]]$stroke, age_old_data_norm[[2]]$stroke)
confusionMatrix(as.factor(knn_age_combined_pred), as.factor(true_label_age))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1193   65
##          1   15    4
##                                         
##                Accuracy : 0.9374        
##                  95% CI : (0.9226, 0.95)
##     No Information Rate : 0.946         
##     P-Value [Acc > NIR] : 0.9202        
##                                         
##                   Kappa : 0.0692        
##                                         
##  Mcnemar's Test P-Value : 4.293e-08     
##                                         
##             Sensitivity : 0.98758       
##             Specificity : 0.05797       
##          Pos Pred Value : 0.94833       
##          Neg Pred Value : 0.21053       
##              Prevalence : 0.94597       
##          Detection Rate : 0.93422       
##    Detection Prevalence : 0.98512       
##       Balanced Accuracy : 0.52278       
##                                         
##        'Positive' Class : 0             
## 

After splitting the data by age, we build 2 separate models and put them together for the combined model. As seen in the confusion matrices, the kappa value for the young-patients knn model is 0, while it is 0.034 for old-patients data. This is because there are only 8 young patients in our test data actually have stroke, while 61 old patients have stroke. As result, the KNN model is very powerful when predicting whether an old patient would have stroke. The combined model results have a kappa value of 0.0692, which is slightly lower than the original model without the age split, likely since younger users are very inaccurate to precit strokes on due to the low number of observations.

KNN Model BMI Based Model

###Risky patients
stroketrain_knn_risk <- bmi_risk_data_norm[[1]]
stroketrain_knn_risk$stroke = NULL
stroketest_knn_risk <- bmi_risk_data_norm[[2]]
stroketest_knn_risk$stroke = NULL
knn_train_labels_risk <- bmi_risk_data_norm[[1]]$stroke
knn_test_labels_risk <- bmi_risk_data_norm[[2]]$stroke

stroke_knn_pred_risk <- knn(train = stroketrain_knn_risk, test = stroketest_knn_risk,
                      cl = knn_train_labels_risk, k= 4)
confusionMatrix(as.factor(stroke_knn_pred_risk), as.factor(knn_test_labels_risk))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 350  33
##          1   7   2
##                                           
##                Accuracy : 0.898           
##                  95% CI : (0.8636, 0.9261)
##     No Information Rate : 0.9107          
##     P-Value [Acc > NIR] : 0.8355          
##                                           
##                   Kappa : 0.0564          
##                                           
##  Mcnemar's Test P-Value : 7.723e-05       
##                                           
##             Sensitivity : 0.98039         
##             Specificity : 0.05714         
##          Pos Pred Value : 0.91384         
##          Neg Pred Value : 0.22222         
##              Prevalence : 0.91071         
##          Detection Rate : 0.89286         
##    Detection Prevalence : 0.97704         
##       Balanced Accuracy : 0.51877         
##                                           
##        'Positive' Class : 0               
## 
###Safe patients
stroketrain_knn_safe <- bmi_safe_data_norm[[1]]
stroketrain_knn_safe$stroke = NULL
stroketest_knn_safe <- bmi_safe_data_norm[[2]]
stroketest_knn_safe$stroke = NULL
knn_train_labels_safe <- bmi_safe_data_norm[[1]]$stroke
knn_test_labels_safe <- bmi_safe_data_norm[[2]]$stroke

stroke_knn_pred_safe <- knn(train = stroketrain_knn_safe, test = stroketest_knn_safe,
                      cl = knn_train_labels_safe, k = 4)
confusionMatrix(as.factor(stroke_knn_pred_safe), as.factor(knn_test_labels_safe))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 852  29
##          1   3   1
##                                           
##                Accuracy : 0.9638          
##                  95% CI : (0.9493, 0.9751)
##     No Information Rate : 0.9661          
##     P-Value [Acc > NIR] : 0.6868          
##                                           
##                   Kappa : 0.0513          
##                                           
##  Mcnemar's Test P-Value : 9.897e-06       
##                                           
##             Sensitivity : 0.99649         
##             Specificity : 0.03333         
##          Pos Pred Value : 0.96708         
##          Neg Pred Value : 0.25000         
##              Prevalence : 0.96610         
##          Detection Rate : 0.96271         
##    Detection Prevalence : 0.99548         
##       Balanced Accuracy : 0.51491         
##                                           
##        'Positive' Class : 0               
## 
##Combined
knn_bmi_combined_pred = c(stroke_knn_pred_risk, stroke_knn_pred_safe) - 1
true_label_bmi = c(bmi_risk_data_norm[[2]]$stroke, bmi_safe_data_norm[[2]]$stroke)
confusionMatrix(as.factor(knn_bmi_combined_pred), as.factor(true_label_bmi))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1202   62
##          1   10    3
##                                           
##                Accuracy : 0.9436          
##                  95% CI : (0.9295, 0.9556)
##     No Information Rate : 0.9491          
##     P-Value [Acc > NIR] : 0.8307          
##                                           
##                   Kappa : 0.061           
##                                           
##  Mcnemar's Test P-Value : 1.851e-09       
##                                           
##             Sensitivity : 0.99175         
##             Specificity : 0.04615         
##          Pos Pred Value : 0.95095         
##          Neg Pred Value : 0.23077         
##              Prevalence : 0.94910         
##          Detection Rate : 0.94127         
##    Detection Prevalence : 0.98982         
##       Balanced Accuracy : 0.51895         
##                                           
##        'Positive' Class : 0               
## 

After splitting the data by bmi value, we build 2 separate models and put them together for the combined model. As seen in the confusion matrices, the kappa value for the risky group (bmi value over 26.5 but under 32) knn model is only 0.0564, and it is 0.0513 for safe group. As result, our KNN model is more powerful when predicting results for patients with extreme bmi value. When we combine two models’ results together, the combined model results have a kappa value of 0.061.

Moving forward, the KNN models have pretty low kappa values and may not be considered as an important factor in the final stacked model. Since the dataset is small and strokes are a much smaller fraction of the observations, knn does not lend well to being a good model for this case.

Logistic Regression Model

We wanted to run a logistic regression model to see if it would perform better than the other models. A benefit of a logistic regression model is that it will emphasis which variables are significant (and in what magnitude) in predicting whether a person will have a stroke or not.

#Mass Effect
MassEffect <- glm(stroke ~ ., data = general_norm[[1]], family = "binomial")
summary(MassEffect)
## 
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = general_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1794  -0.3092  -0.1550  -0.0754   3.4351  
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -7.04369    1.04540  -6.738 1.61e-11 ***
## genderFemale                 0.05508    0.16784   0.328 0.742782    
## genderMale                        NA         NA      NA       NA    
## age                          6.18124    0.57718  10.709  < 2e-16 ***
## hypertension                 0.47856    0.19008   2.518 0.011813 *  
## heart_disease                0.36958    0.21575   1.713 0.086716 .  
## ever_marriedYes             -0.06185    0.28130  -0.220 0.825977    
## work_typeGovt_job           -0.93496    1.14011  -0.820 0.412182    
## work_typeNever_worked       -9.94444  417.91595  -0.024 0.981016    
## work_typePrivate            -0.73719    1.12345  -0.656 0.511706    
## work_typeSelf.employed      -1.07852    1.14208  -0.944 0.344993    
## Residence_typeUrban          0.12903    0.16239   0.795 0.426850    
## avg_glucose_level            1.06269    0.29927   3.551 0.000384 ***
## bmi                          0.76322    1.15999   0.658 0.510566    
## smoking_statusnever.smoked  -0.29565    0.20410  -1.449 0.147453    
## smoking_statussmokes         0.09566    0.24329   0.393 0.694183    
## smoking_statusUnknown       -0.15294    0.24652  -0.620 0.535006    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1470.4  on 3831  degrees of freedom
## Residual deviance: 1144.7  on 3816  degrees of freedom
## AIC: 1176.7
## 
## Number of Fisher Scoring iterations: 14
#Final
logit_final <- step(MassEffect)
## Start:  AIC=1176.73
## stroke ~ genderFemale + genderMale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
## 
## Step:  AIC=1176.73
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - ever_marriedYes             1   1144.8 1174.8
## - work_typeNever_worked       1   1144.8 1174.8
## - genderFemale                1   1144.8 1174.8
## - smoking_statussmokes        1   1144.9 1174.9
## - work_typePrivate            1   1145.1 1175.1
## - smoking_statusUnknown       1   1145.1 1175.1
## - bmi                         1   1145.2 1175.2
## - work_typeGovt_job           1   1145.3 1175.3
## - Residence_typeUrban         1   1145.4 1175.4
## - work_typeSelf.employed      1   1145.5 1175.5
## <none>                            1144.7 1176.7
## - smoking_statusnever.smoked  1   1146.8 1176.8
## - heart_disease               1   1147.5 1177.5
## - hypertension                1   1150.8 1180.8
## - avg_glucose_level           1   1157.0 1187.0
## - age                         1   1286.6 1316.6
## 
## Step:  AIC=1174.77
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     work_typeGovt_job + work_typeNever_worked + work_typePrivate + 
##     work_typeSelf.employed + Residence_typeUrban + avg_glucose_level + 
##     bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - work_typeNever_worked       1   1144.9 1172.9
## - genderFemale                1   1144.9 1172.9
## - smoking_statussmokes        1   1144.9 1172.9
## - smoking_statusUnknown       1   1145.2 1173.2
## - work_typePrivate            1   1145.2 1173.2
## - bmi                         1   1145.2 1173.2
## - work_typeGovt_job           1   1145.4 1173.4
## - Residence_typeUrban         1   1145.4 1173.4
## - work_typeSelf.employed      1   1145.6 1173.6
## <none>                            1144.8 1174.8
## - smoking_statusnever.smoked  1   1146.8 1174.8
## - heart_disease               1   1147.7 1175.7
## - hypertension                1   1150.8 1178.8
## - avg_glucose_level           1   1157.0 1185.0
## - age                         1   1297.9 1325.9
## 
## Step:  AIC=1172.86
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     work_typeGovt_job + work_typePrivate + work_typeSelf.employed + 
##     Residence_typeUrban + avg_glucose_level + bmi + smoking_statusnever.smoked + 
##     smoking_statussmokes + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - genderFemale                1   1145.0 1171.0
## - smoking_statussmokes        1   1145.0 1171.0
## - smoking_statusUnknown       1   1145.2 1171.2
## - work_typePrivate            1   1145.2 1171.2
## - bmi                         1   1145.3 1171.3
## - work_typeGovt_job           1   1145.4 1171.4
## - Residence_typeUrban         1   1145.5 1171.5
## - work_typeSelf.employed      1   1145.6 1171.6
## <none>                            1144.9 1172.9
## - smoking_statusnever.smoked  1   1146.9 1172.9
## - heart_disease               1   1147.8 1173.8
## - hypertension                1   1150.9 1176.9
## - avg_glucose_level           1   1157.1 1183.1
## - age                         1   1298.0 1324.0
## 
## Step:  AIC=1170.98
## stroke ~ age + hypertension + heart_disease + work_typeGovt_job + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - smoking_statussmokes        1   1145.1 1169.1
## - smoking_statusUnknown       1   1145.3 1169.3
## - work_typePrivate            1   1145.3 1169.3
## - bmi                         1   1145.4 1169.4
## - work_typeGovt_job           1   1145.5 1169.5
## - Residence_typeUrban         1   1145.6 1169.6
## - work_typeSelf.employed      1   1145.7 1169.7
## - smoking_statusnever.smoked  1   1147.0 1171.0
## <none>                            1145.0 1171.0
## - heart_disease               1   1147.8 1171.8
## - hypertension                1   1151.1 1175.1
## - avg_glucose_level           1   1157.1 1181.1
## - age                         1   1298.1 1322.1
## 
## Step:  AIC=1169.14
## stroke ~ age + hypertension + heart_disease + work_typeGovt_job + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - work_typePrivate            1   1145.5 1167.5
## - bmi                         1   1145.5 1167.5
## - work_typeGovt_job           1   1145.7 1167.7
## - smoking_statusUnknown       1   1145.8 1167.8
## - Residence_typeUrban         1   1145.8 1167.8
## - work_typeSelf.employed      1   1145.8 1167.8
## <none>                            1145.1 1169.1
## - heart_disease               1   1148.0 1170.0
## - smoking_statusnever.smoked  1   1148.3 1170.3
## - hypertension                1   1151.2 1173.2
## - avg_glucose_level           1   1157.3 1179.3
## - age                         1   1300.6 1322.6
## 
## Step:  AIC=1167.49
## stroke ~ age + hypertension + heart_disease + work_typeGovt_job + 
##     work_typeSelf.employed + Residence_typeUrban + avg_glucose_level + 
##     bmi + smoking_statusnever.smoked + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - bmi                         1   1145.8 1165.8
## - smoking_statusUnknown       1   1146.0 1166.0
## - work_typeGovt_job           1   1146.2 1166.2
## - Residence_typeUrban         1   1146.2 1166.2
## <none>                            1145.5 1167.5
## - heart_disease               1   1148.4 1168.4
## - work_typeSelf.employed      1   1148.6 1168.6
## - smoking_statusnever.smoked  1   1148.6 1168.6
## - hypertension                1   1151.7 1171.7
## - avg_glucose_level           1   1157.8 1177.8
## - age                         1   1322.5 1342.5
## 
## Step:  AIC=1165.79
## stroke ~ age + hypertension + heart_disease + work_typeGovt_job + 
##     work_typeSelf.employed + Residence_typeUrban + avg_glucose_level + 
##     smoking_statusnever.smoked + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - smoking_statusUnknown       1   1146.3 1164.3
## - work_typeGovt_job           1   1146.4 1164.4
## - Residence_typeUrban         1   1146.5 1164.5
## <none>                            1145.8 1165.8
## - heart_disease               1   1148.6 1166.6
## - smoking_statusnever.smoked  1   1148.9 1166.9
## - work_typeSelf.employed      1   1148.9 1166.9
## - hypertension                1   1152.3 1170.3
## - avg_glucose_level           1   1159.4 1177.4
## - age                         1   1322.5 1340.5
## 
## Step:  AIC=1164.35
## stroke ~ age + hypertension + heart_disease + work_typeGovt_job + 
##     work_typeSelf.employed + Residence_typeUrban + avg_glucose_level + 
##     smoking_statusnever.smoked
## 
##                              Df Deviance    AIC
## - work_typeGovt_job           1   1147.0 1163.0
## - Residence_typeUrban         1   1147.0 1163.0
## <none>                            1146.3 1164.3
## - smoking_statusnever.smoked  1   1148.9 1164.9
## - heart_disease               1   1149.3 1165.3
## - work_typeSelf.employed      1   1149.5 1165.5
## - hypertension                1   1153.2 1169.2
## - avg_glucose_level           1   1160.3 1176.3
## - age                         1   1329.5 1345.5
## 
## Step:  AIC=1162.98
## stroke ~ age + hypertension + heart_disease + work_typeSelf.employed + 
##     Residence_typeUrban + avg_glucose_level + smoking_statusnever.smoked
## 
##                              Df Deviance    AIC
## - Residence_typeUrban         1   1147.6 1161.6
## <none>                            1147.0 1163.0
## - smoking_statusnever.smoked  1   1149.6 1163.6
## - work_typeSelf.employed      1   1149.7 1163.7
## - heart_disease               1   1150.0 1164.0
## - hypertension                1   1153.9 1167.9
## - avg_glucose_level           1   1161.0 1175.0
## - age                         1   1329.6 1343.6
## 
## Step:  AIC=1161.63
## stroke ~ age + hypertension + heart_disease + work_typeSelf.employed + 
##     avg_glucose_level + smoking_statusnever.smoked
## 
##                              Df Deviance    AIC
## <none>                            1147.6 1161.6
## - smoking_statusnever.smoked  1   1150.4 1162.4
## - work_typeSelf.employed      1   1150.4 1162.4
## - heart_disease               1   1150.6 1162.6
## - hypertension                1   1154.5 1166.5
## - avg_glucose_level           1   1161.8 1173.8
## - age                         1   1330.8 1342.8
summary(logit_final)
## 
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_typeSelf.employed + 
##     avg_glucose_level + smoking_statusnever.smoked, family = "binomial", 
##     data = general_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1294  -0.3114  -0.1591  -0.0735   3.5923  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -7.4843     0.4239 -17.654  < 2e-16 ***
## age                          5.9966     0.5274  11.371  < 2e-16 ***
## hypertension                 0.5064     0.1885   2.686 0.007225 ** 
## heart_disease                0.3743     0.2120   1.765 0.077571 .  
## work_typeSelf.employed      -0.3087     0.1882  -1.641 0.100866    
## avg_glucose_level            1.1096     0.2895   3.833 0.000127 ***
## smoking_statusnever.smoked  -0.2817     0.1711  -1.646 0.099671 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1470.4  on 3831  degrees of freedom
## Residual deviance: 1147.6  on 3825  degrees of freedom
## AIC: 1161.6
## 
## Number of Fisher Scoring iterations: 7
#Confusion Matrix (.11 gave the highest kappa statistic)
LogReg_Pred = predict(logit_final, newdata = general_norm[[2]], type = "response")
LogReg_Pred = ifelse(LogReg_Pred>.11, 1, 0)
confusionMatrix(data = as.factor(LogReg_Pred), reference = as.factor(general_norm[[2]]$stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1050   34
##          1  161   32
##                                           
##                Accuracy : 0.8473          
##                  95% CI : (0.8264, 0.8666)
##     No Information Rate : 0.9483          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1843          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8671          
##             Specificity : 0.4848          
##          Pos Pred Value : 0.9686          
##          Neg Pred Value : 0.1658          
##              Prevalence : 0.9483          
##          Detection Rate : 0.8222          
##    Detection Prevalence : 0.8489          
##       Balanced Accuracy : 0.6760          
##                                           
##        'Positive' Class : 0               
## 

Based on the general model ran, we can see that out of all the original variables, there are significantly less that play a significant role in predicting stroke. We can see that age, hypertension, heart disease, and average glucose level are the most significant. We also see that all of the previously mentioned factors actually increase the log-odds ratio if having a stroke. With the final model, we can see that it is actually quite accurate in predicting stroke or not; it had an accuracy level of .8473. However, it only had a kappa statistic of .1843. This is mostly because there were a much larger number of “negative”(s) than “positive”(s) of stroke in the test data, therefore making the chance of simply guessing randomly correctly much higher if one was to guess “negative” on all the patients. While the model was quite accurate we would want to decrease the number of false negatives as much as possible because the last thing we want is to tell a patient that they are not at risk of having a stroke, and then they have a stroke because we didn’t start the remedial process due to the model’s prediction. To combat the model from simply either predicting all “negative” or all “positive”, we changed the threshold from .5 to .11, being the cutoff point to when we would want to start warning patients of their potential to having a stroke. This is similar to instead of warning them when they have a 50% chance, we warn them when they have an 11% chance of stroke. 11% was the cutoff point that granted the highest kappa statistic. Next, we wanted to see if the age and BMI split could produce a better model.

Logistic Regression Age Based Model

#Young
masslogmodel_young <- glm(stroke ~., data = age_young_data_norm[[1]], family = "binomial")
summary(masslogmodel_young)
## 
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = age_young_data_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7079  -0.1625  -0.0912  -0.0543   3.6158  
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -7.05982    1.27030  -5.558 2.73e-08 ***
## genderFemale                 0.81243    0.42362   1.918  0.05513 .  
## genderMale                        NA         NA      NA       NA    
## age                          4.88849    1.48608   3.290  0.00100 ** 
## hypertension                 1.33678    0.44662   2.993  0.00276 ** 
## heart_disease               -0.21075    1.07698  -0.196  0.84486    
## ever_marriedYes              0.26457    0.66084   0.400  0.68890    
## work_typeGovt_job           -2.26266    1.54951  -1.460  0.14422    
## work_typeNever_worked      -12.44246  974.07648  -0.013  0.98981    
## work_typePrivate            -2.18137    1.46939  -1.485  0.13767    
## work_typeSelf.employed      -1.70753    1.53926  -1.109  0.26729    
## Residence_typeUrban          0.10064    0.37424   0.269  0.78799    
## avg_glucose_level            0.09394    0.86770   0.108  0.91379    
## bmi                          1.45070    1.96657   0.738  0.46071    
## smoking_statusnever.smoked  -0.23963    0.48887  -0.490  0.62402    
## smoking_statussmokes        -0.11735    0.55834  -0.210  0.83354    
## smoking_statusUnknown       -0.65006    0.67718  -0.960  0.33708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.28  on 2559  degrees of freedom
## Residual deviance: 282.09  on 2544  degrees of freedom
## AIC: 314.09
## 
## Number of Fisher Scoring iterations: 16
finallogmodel_young <- step(masslogmodel_young)
## Start:  AIC=314.09
## stroke ~ genderFemale + genderMale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
## 
## Step:  AIC=314.09
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - avg_glucose_level           1   282.10 312.10
## - heart_disease               1   282.13 312.13
## - smoking_statussmokes        1   282.13 312.13
## - Residence_typeUrban         1   282.16 312.16
## - ever_marriedYes             1   282.25 312.25
## - work_typeNever_worked       1   282.27 312.27
## - smoking_statusnever.smoked  1   282.32 312.32
## - bmi                         1   282.60 312.60
## - smoking_statusUnknown       1   283.05 313.05
## - work_typeSelf.employed      1   283.20 313.20
## - work_typeGovt_job           1   283.94 313.94
## - work_typePrivate            1   283.94 313.94
## <none>                            282.08 314.09
## - genderFemale                1   286.16 316.16
## - hypertension                1   289.85 319.85
## - age                         1   294.65 324.65
## 
## Step:  AIC=312.1
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - heart_disease               1   282.14 310.14
## - smoking_statussmokes        1   282.14 310.14
## - Residence_typeUrban         1   282.17 310.17
## - ever_marriedYes             1   282.27 310.27
## - work_typeNever_worked       1   282.28 310.28
## - smoking_statusnever.smoked  1   282.33 310.33
## - bmi                         1   282.64 310.64
## - smoking_statusUnknown       1   283.07 311.07
## - work_typeSelf.employed      1   283.23 311.23
## - work_typeGovt_job           1   283.98 311.98
## - work_typePrivate            1   283.98 311.98
## <none>                            282.10 312.10
## - genderFemale                1   286.17 314.16
## - hypertension                1   290.21 318.21
## - age                         1   294.93 322.93
## 
## Step:  AIC=310.14
## stroke ~ genderFemale + age + hypertension + ever_marriedYes + 
##     work_typeGovt_job + work_typeNever_worked + work_typePrivate + 
##     work_typeSelf.employed + Residence_typeUrban + bmi + smoking_statusnever.smoked + 
##     smoking_statussmokes + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - smoking_statussmokes        1   282.19 308.19
## - Residence_typeUrban         1   282.20 308.20
## - work_typeNever_worked       1   282.32 308.32
## - ever_marriedYes             1   282.32 308.32
## - smoking_statusnever.smoked  1   282.38 308.38
## - bmi                         1   282.67 308.67
## - smoking_statusUnknown       1   283.11 309.11
## - work_typeSelf.employed      1   283.25 309.25
## - work_typePrivate            1   284.00 310.00
## - work_typeGovt_job           1   284.00 310.00
## <none>                            282.14 310.14
## - genderFemale                1   286.24 312.24
## - hypertension                1   290.23 316.23
## - age                         1   295.04 321.04
## 
## Step:  AIC=308.19
## stroke ~ genderFemale + age + hypertension + ever_marriedYes + 
##     work_typeGovt_job + work_typeNever_worked + work_typePrivate + 
##     work_typeSelf.employed + Residence_typeUrban + bmi + smoking_statusnever.smoked + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - Residence_typeUrban         1   282.26 306.26
## - ever_marriedYes             1   282.37 306.37
## - work_typeNever_worked       1   282.38 306.38
## - smoking_statusnever.smoked  1   282.38 306.38
## - bmi                         1   282.73 306.73
## - smoking_statusUnknown       1   283.18 307.18
## - work_typeSelf.employed      1   283.32 307.32
## - work_typePrivate            1   284.09 308.09
## - work_typeGovt_job           1   284.09 308.09
## <none>                            282.19 308.19
## - genderFemale                1   286.27 310.27
## - hypertension                1   290.24 314.24
## - age                         1   295.42 319.42
## 
## Step:  AIC=306.26
## stroke ~ genderFemale + age + hypertension + ever_marriedYes + 
##     work_typeGovt_job + work_typeNever_worked + work_typePrivate + 
##     work_typeSelf.employed + bmi + smoking_statusnever.smoked + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - work_typeNever_worked       1   282.44 304.44
## - ever_marriedYes             1   282.44 304.45
## - smoking_statusnever.smoked  1   282.46 304.46
## - bmi                         1   282.79 304.79
## - smoking_statusUnknown       1   283.25 305.25
## - work_typeSelf.employed      1   283.41 305.41
## - work_typePrivate            1   284.17 306.17
## - work_typeGovt_job           1   284.17 306.17
## <none>                            282.26 306.26
## - genderFemale                1   286.41 308.41
## - hypertension                1   290.28 312.28
## - age                         1   295.57 317.57
## 
## Step:  AIC=304.44
## stroke ~ genderFemale + age + hypertension + ever_marriedYes + 
##     work_typeGovt_job + work_typePrivate + work_typeSelf.employed + 
##     bmi + smoking_statusnever.smoked + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - ever_marriedYes             1   282.63 302.63
## - smoking_statusnever.smoked  1   282.66 302.66
## - bmi                         1   282.96 302.96
## - smoking_statusUnknown       1   283.39 303.39
## - work_typeSelf.employed      1   283.47 303.47
## - work_typePrivate            1   284.20 304.20
## - work_typeGovt_job           1   284.21 304.21
## <none>                            282.44 304.44
## - genderFemale                1   286.59 306.59
## - hypertension                1   290.50 310.50
## - age                         1   295.66 315.66
## 
## Step:  AIC=302.63
## stroke ~ genderFemale + age + hypertension + work_typeGovt_job + 
##     work_typePrivate + work_typeSelf.employed + bmi + smoking_statusnever.smoked + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - smoking_statusnever.smoked  1   282.84 300.84
## - bmi                         1   283.22 301.22
## - work_typeSelf.employed      1   283.58 301.58
## - smoking_statusUnknown       1   283.59 301.59
## - work_typeGovt_job           1   284.30 302.30
## - work_typePrivate            1   284.30 302.30
## <none>                            282.63 302.63
## - genderFemale                1   286.78 304.78
## - hypertension                1   290.61 308.61
## - age                         1   301.15 319.15
## 
## Step:  AIC=300.84
## stroke ~ genderFemale + age + hypertension + work_typeGovt_job + 
##     work_typePrivate + work_typeSelf.employed + bmi + smoking_statusUnknown
## 
##                          Df Deviance    AIC
## - bmi                     1   283.41 299.41
## - smoking_statusUnknown   1   283.60 299.60
## - work_typeSelf.employed  1   283.81 299.80
## - work_typePrivate        1   284.51 300.51
## - work_typeGovt_job       1   284.51 300.51
## <none>                        282.84 300.84
## - genderFemale            1   286.89 302.89
## - hypertension            1   290.81 306.81
## - age                     1   301.87 317.87
## 
## Step:  AIC=299.41
## stroke ~ genderFemale + age + hypertension + work_typeGovt_job + 
##     work_typePrivate + work_typeSelf.employed + smoking_statusUnknown
## 
##                          Df Deviance    AIC
## - smoking_statusUnknown   1   284.16 298.16
## - work_typeSelf.employed  1   284.19 298.19
## - work_typePrivate        1   284.87 298.87
## - work_typeGovt_job       1   284.89 298.89
## <none>                        283.41 299.41
## - genderFemale            1   287.28 301.28
## - hypertension            1   292.27 306.27
## - age                     1   302.69 316.69
## 
## Step:  AIC=298.16
## stroke ~ genderFemale + age + hypertension + work_typeGovt_job + 
##     work_typePrivate + work_typeSelf.employed
## 
##                          Df Deviance    AIC
## - work_typeSelf.employed  1   284.71 296.71
## - work_typePrivate        1   285.33 297.33
## - work_typeGovt_job       1   285.34 297.34
## <none>                        284.16 298.16
## - genderFemale            1   288.09 300.09
## - hypertension            1   293.70 305.70
## - age                     1   304.23 316.23
## 
## Step:  AIC=296.71
## stroke ~ genderFemale + age + hypertension + work_typeGovt_job + 
##     work_typePrivate
## 
##                     Df Deviance    AIC
## - work_typeGovt_job  1   285.96 295.96
## - work_typePrivate   1   286.34 296.34
## <none>                   284.71 296.71
## - genderFemale       1   288.52 298.52
## - hypertension       1   294.35 304.35
## - age                1   311.48 321.48
## 
## Step:  AIC=295.96
## stroke ~ genderFemale + age + hypertension + work_typePrivate
## 
##                    Df Deviance    AIC
## - work_typePrivate  1   286.53 294.53
## <none>                  285.96 295.96
## - genderFemale      1   289.67 297.67
## - hypertension      1   295.68 303.68
## - age               1   311.51 319.51
## 
## Step:  AIC=294.53
## stroke ~ genderFemale + age + hypertension
## 
##                Df Deviance    AIC
## <none>              286.53 294.53
## - genderFemale  1   290.24 296.24
## - hypertension  1   296.41 302.41
## - age           1   311.63 317.63
summary(finallogmodel_young)
## 
## Call:
## glm(formula = stroke ~ genderFemale + age + hypertension, family = "binomial", 
##     data = age_young_data_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6027  -0.1741  -0.1014  -0.0524   3.8905  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -8.4395     0.9930  -8.499  < 2e-16 ***
## genderFemale   0.7684     0.4194   1.832 0.066909 .  
## age            4.5920     1.1193   4.103 4.09e-05 ***
## hypertension   1.4656     0.4235   3.461 0.000538 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.28  on 2559  degrees of freedom
## Residual deviance: 286.53  on 2556  degrees of freedom
## AIC: 294.53
## 
## Number of Fisher Scoring iterations: 9
young_Logpred <- predict(finallogmodel_young, age_young_data_norm[[2]], type= "response")
young_Logpred = ifelse(young_Logpred>.0002, 0, 1)

#Old
masslogmodel_old <- glm(stroke ~., data = age_old_data_norm[[1]], family = "binomial")
summary(masslogmodel_old)
## 
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = age_old_data_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1062  -0.5462  -0.4091  -0.3190   2.6960  
## 
## Coefficients: (3 not defined because of singularities)
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -3.43242    0.58448  -5.873 4.29e-09 ***
## genderFemale               -0.32418    0.18368  -1.765  0.07758 .  
## genderMale                       NA         NA      NA       NA    
## age                         1.83062    0.32268   5.673 1.40e-08 ***
## hypertension                0.22121    0.21240   1.042  0.29764    
## heart_disease               0.11018    0.23752   0.464  0.64273    
## ever_marriedYes            -0.19426    0.30809  -0.631  0.52836    
## work_typeGovt_job           0.49115    0.30076   1.633  0.10246    
## work_typeNever_worked            NA         NA      NA       NA    
## work_typePrivate            0.53097    0.21397   2.482  0.01308 *  
## work_typeSelf.employed           NA         NA      NA       NA    
## Residence_typeUrban         0.16568    0.18071   0.917  0.35925    
## avg_glucose_level           1.02001    0.33104   3.081  0.00206 ** 
## bmi                        -0.05227    0.80485  -0.065  0.94822    
## smoking_statusnever.smoked -0.11948    0.22124  -0.540  0.58915    
## smoking_statussmokes       -0.14800    0.29738  -0.498  0.61870    
## smoking_statusUnknown       0.02076    0.26433   0.079  0.93739    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 918.85  on 1271  degrees of freedom
## Residual deviance: 858.69  on 1258  degrees of freedom
## AIC: 886.69
## 
## Number of Fisher Scoring iterations: 5
finallogmodel_old <- step(masslogmodel_old)
## Start:  AIC=886.69
## stroke ~ genderFemale + genderMale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
## 
## Step:  AIC=886.69
## stroke ~ genderFemale + genderMale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + Residence_typeUrban + avg_glucose_level + 
##     bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
## 
## Step:  AIC=886.69
## stroke ~ genderFemale + genderMale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typePrivate + 
##     Residence_typeUrban + avg_glucose_level + bmi + smoking_statusnever.smoked + 
##     smoking_statussmokes + smoking_statusUnknown
## 
## 
## Step:  AIC=886.69
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typePrivate + 
##     Residence_typeUrban + avg_glucose_level + bmi + smoking_statusnever.smoked + 
##     smoking_statussmokes + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - bmi                         1   858.69 884.69
## - smoking_statusUnknown       1   858.69 884.69
## - heart_disease               1   858.90 884.90
## - smoking_statussmokes        1   858.94 884.94
## - smoking_statusnever.smoked  1   858.98 884.98
## - ever_marriedYes             1   859.07 885.07
## - Residence_typeUrban         1   859.53 885.53
## - hypertension                1   859.75 885.75
## <none>                            858.69 886.69
## - work_typeGovt_job           1   861.26 887.26
## - genderFemale                1   861.79 887.79
## - work_typePrivate            1   865.12 891.12
## - avg_glucose_level           1   867.93 893.93
## - age                         1   892.97 918.97
## 
## Step:  AIC=884.69
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typePrivate + 
##     Residence_typeUrban + avg_glucose_level + smoking_statusnever.smoked + 
##     smoking_statussmokes + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - smoking_statusUnknown       1   858.70 882.70
## - heart_disease               1   858.90 882.90
## - smoking_statussmokes        1   858.94 882.94
## - smoking_statusnever.smoked  1   858.98 882.98
## - ever_marriedYes             1   859.07 883.07
## - Residence_typeUrban         1   859.54 883.54
## - hypertension                1   859.75 883.75
## <none>                            858.69 884.69
## - work_typeGovt_job           1   861.27 885.27
## - genderFemale                1   861.80 885.80
## - work_typePrivate            1   865.12 889.12
## - avg_glucose_level           1   868.40 892.40
## - age                         1   895.27 919.27
## 
## Step:  AIC=882.7
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typePrivate + 
##     Residence_typeUrban + avg_glucose_level + smoking_statusnever.smoked + 
##     smoking_statussmokes
## 
##                              Df Deviance    AIC
## - heart_disease               1   858.91 880.91
## - smoking_statussmokes        1   859.01 881.01
## - ever_marriedYes             1   859.08 881.08
## - smoking_statusnever.smoked  1   859.11 881.11
## - Residence_typeUrban         1   859.54 881.54
## - hypertension                1   859.75 881.75
## <none>                            858.70 882.70
## - work_typeGovt_job           1   861.28 883.28
## - genderFemale                1   861.80 883.80
## - work_typePrivate            1   865.13 887.13
## - avg_glucose_level           1   868.40 890.40
## - age                         1   895.28 917.28
## 
## Step:  AIC=880.91
## stroke ~ genderFemale + age + hypertension + ever_marriedYes + 
##     work_typeGovt_job + work_typePrivate + Residence_typeUrban + 
##     avg_glucose_level + smoking_statusnever.smoked + smoking_statussmokes
## 
##                              Df Deviance    AIC
## - smoking_statussmokes        1   859.19 879.19
## - ever_marriedYes             1   859.32 879.32
## - smoking_statusnever.smoked  1   859.35 879.35
## - Residence_typeUrban         1   859.75 879.75
## - hypertension                1   859.97 879.97
## <none>                            858.91 880.91
## - work_typeGovt_job           1   861.44 881.44
## - genderFemale                1   862.26 882.26
## - work_typePrivate            1   865.37 885.37
## - avg_glucose_level           1   868.94 888.94
## - age                         1   897.39 917.39
## 
## Step:  AIC=879.19
## stroke ~ genderFemale + age + hypertension + ever_marriedYes + 
##     work_typeGovt_job + work_typePrivate + Residence_typeUrban + 
##     avg_glucose_level + smoking_statusnever.smoked
## 
##                              Df Deviance    AIC
## - smoking_statusnever.smoked  1   859.46 877.46
## - ever_marriedYes             1   859.57 877.57
## - Residence_typeUrban         1   860.01 878.01
## - hypertension                1   860.22 878.22
## <none>                            859.19 879.19
## - work_typeGovt_job           1   861.75 879.75
## - genderFemale                1   862.50 880.50
## - work_typePrivate            1   865.68 883.68
## - avg_glucose_level           1   869.23 887.23
## - age                         1   898.53 916.53
## 
## Step:  AIC=877.46
## stroke ~ genderFemale + age + hypertension + ever_marriedYes + 
##     work_typeGovt_job + work_typePrivate + Residence_typeUrban + 
##     avg_glucose_level
## 
##                       Df Deviance    AIC
## - ever_marriedYes      1   859.82 875.82
## - Residence_typeUrban  1   860.35 876.35
## - hypertension         1   860.43 876.43
## <none>                     859.46 877.46
## - work_typeGovt_job    1   861.93 877.93
## - genderFemale         1   863.07 879.07
## - work_typePrivate     1   865.90 881.90
## - avg_glucose_level    1   869.50 885.50
## - age                  1   898.65 914.65
## 
## Step:  AIC=875.82
## stroke ~ genderFemale + age + hypertension + work_typeGovt_job + 
##     work_typePrivate + Residence_typeUrban + avg_glucose_level
## 
##                       Df Deviance    AIC
## - Residence_typeUrban  1   860.75 874.75
## - hypertension         1   860.81 874.81
## <none>                     859.82 875.82
## - work_typeGovt_job    1   862.31 876.31
## - genderFemale         1   863.35 877.35
## - work_typePrivate     1   866.17 880.17
## - avg_glucose_level    1   869.72 883.72
## - age                  1   899.99 913.99
## 
## Step:  AIC=874.75
## stroke ~ genderFemale + age + hypertension + work_typeGovt_job + 
##     work_typePrivate + avg_glucose_level
## 
##                     Df Deviance    AIC
## - hypertension       1   861.74 873.74
## <none>                   860.75 874.75
## - work_typeGovt_job  1   863.29 875.29
## - genderFemale       1   864.39 876.39
## - work_typePrivate   1   867.11 879.11
## - avg_glucose_level  1   870.71 882.71
## - age                1   901.22 913.22
## 
## Step:  AIC=873.74
## stroke ~ genderFemale + age + work_typeGovt_job + work_typePrivate + 
##     avg_glucose_level
## 
##                     Df Deviance    AIC
## <none>                   861.74 873.74
## - work_typeGovt_job  1   864.11 874.11
## - genderFemale       1   865.29 875.29
## - work_typePrivate   1   867.95 877.95
## - avg_glucose_level  1   872.66 882.66
## - age                1   903.06 913.06
summary(finallogmodel_old)
## 
## Call:
## glm(formula = stroke ~ genderFemale + age + work_typeGovt_job + 
##     work_typePrivate + avg_glucose_level, family = "binomial", 
##     data = age_old_data_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0503  -0.5555  -0.4143  -0.3224   2.7545  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.5669     0.3137 -11.369  < 2e-16 ***
## genderFemale       -0.3397     0.1800  -1.888 0.059065 .  
## age                 1.9002     0.3060   6.209 5.32e-10 ***
## work_typeGovt_job   0.4674     0.2986   1.565 0.117511    
## work_typePrivate    0.5196     0.2129   2.440 0.014689 *  
## avg_glucose_level   1.0605     0.3153   3.363 0.000771 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 918.85  on 1271  degrees of freedom
## Residual deviance: 861.74  on 1266  degrees of freedom
## AIC: 873.74
## 
## Number of Fisher Scoring iterations: 5
old_Logpred <- predict(finallogmodel_old, age_old_data_norm[[2]], type= "response")
old_Logpred = ifelse(old_Logpred>.45, 0, 1)

#Combine Predictions (The thresholds above gave us the highest kappa statistic)
age_combined_Logpred = c(young_Logpred, old_Logpred)

age_true_label = c(age_young_data_norm[[2]]$stroke, age_old_data_norm[[2]]$stroke)

confusionMatrix(as.factor(age_combined_Logpred), as.factor(age_true_label))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 845   8
##          1 363  61
##                                           
##                Accuracy : 0.7095          
##                  95% CI : (0.6837, 0.7343)
##     No Information Rate : 0.946           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1704          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6995          
##             Specificity : 0.8841          
##          Pos Pred Value : 0.9906          
##          Neg Pred Value : 0.1439          
##              Prevalence : 0.9460          
##          Detection Rate : 0.6617          
##    Detection Prevalence : 0.6680          
##       Balanced Accuracy : 0.7918          
##                                           
##        'Positive' Class : 0               
## 

When we create a logistic regression model based on an age split, we can see that accuracy decreases a fair bit while the kappa statistic also decreases somewhat. This indicates a weaker model than the general model. However, it does show that those who are younger have different predictor variables for stroke when compared to those that are older. For example, For the younger group, hypertension and age were the significant variables in predicting stroke while for the older group, age and average glucose level were the significant variables. While this does provide us with further insight, it overall generates a weaker model. This could be due to the over-categorization of age into “old” and “young”. It is also important to note the different cutoff thresholds that we implemented for the “young model” and the “old model”. For the “young model”, we set the threshold to be: any prediction greater than .0002, predict stroke, otherwise, predict no stroke. For the “old model” we set the threshold to be .45. While setting the thresholds to this led to the “young model” predicting all “no stroke” and the “old model” predicting all “stroke,” we chose these thresholds because they maximized the kappa statistic without sacrificing too much accuracy. This pattern can be explained by how in general, age is more significant when predicting stroke or no stroke.

Logistic Regression BMI Based Models

#Risk
masslogmodel_risk <- glm(stroke ~., data = bmi_risk_data_norm[[1]], family = "binomial")
summary(masslogmodel_risk)
## 
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = bmi_risk_data_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0605  -0.4314  -0.2664  -0.1460   2.9256  
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -4.129e+00  1.122e+00  -3.679 0.000234 ***
## genderFemale                4.986e-02  2.430e-01   0.205 0.837396    
## genderMale                         NA         NA      NA       NA    
## age                         5.098e+00  8.325e-01   6.124 9.14e-10 ***
## hypertension                3.825e-03  3.132e-01   0.012 0.990256    
## heart_disease               8.132e-04  3.441e-01   0.002 0.998115    
## ever_marriedYes             4.662e-01  4.940e-01   0.944 0.345279    
## work_typeGovt_job          -3.018e+00  1.274e+00  -2.369 0.017821 *  
## work_typeNever_worked      -1.254e+01  6.488e+02  -0.019 0.984577    
## work_typePrivate           -2.993e+00  1.244e+00  -2.406 0.016121 *  
## work_typeSelf.employed     -3.069e+00  1.272e+00  -2.413 0.015829 *  
## Residence_typeUrban         2.324e-01  2.398e-01   0.969 0.332530    
## avg_glucose_level           1.012e+00  4.641e-01   2.181 0.029188 *  
## bmi                         2.478e-01  4.214e-01   0.588 0.556548    
## smoking_statusnever.smoked -1.855e-01  3.001e-01  -0.618 0.536339    
## smoking_statussmokes        1.795e-01  3.759e-01   0.478 0.632960    
## smoking_statusUnknown      -3.250e-01  3.805e-01  -0.854 0.392999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 600.22  on 1176  degrees of freedom
## Residual deviance: 512.41  on 1161  degrees of freedom
## AIC: 544.41
## 
## Number of Fisher Scoring iterations: 14
finallogmodel_risk <- step(masslogmodel_risk)
## Start:  AIC=544.41
## stroke ~ genderFemale + genderMale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
## 
## Step:  AIC=544.41
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - heart_disease               1   512.41 542.41
## - hypertension                1   512.41 542.41
## - genderFemale                1   512.45 542.45
## - smoking_statussmokes        1   512.63 542.63
## - bmi                         1   512.75 542.75
## - smoking_statusnever.smoked  1   512.79 542.79
## - work_typeNever_worked       1   512.83 542.83
## - smoking_statusUnknown       1   513.15 543.15
## - Residence_typeUrban         1   513.35 543.35
## - ever_marriedYes             1   513.39 543.39
## <none>                            512.41 544.41
## - work_typeGovt_job           1   516.02 546.02
## - work_typePrivate            1   516.04 546.04
## - work_typeSelf.employed      1   516.12 546.12
## - avg_glucose_level           1   517.00 547.00
## - age                         1   557.03 587.03
## 
## Step:  AIC=542.41
## stroke ~ genderFemale + age + hypertension + ever_marriedYes + 
##     work_typeGovt_job + work_typeNever_worked + work_typePrivate + 
##     work_typeSelf.employed + Residence_typeUrban + avg_glucose_level + 
##     bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - hypertension                1   512.41 540.41
## - genderFemale                1   512.45 540.45
## - smoking_statussmokes        1   512.63 540.63
## - bmi                         1   512.76 540.76
## - smoking_statusnever.smoked  1   512.79 540.79
## - work_typeNever_worked       1   512.83 540.83
## - smoking_statusUnknown       1   513.15 541.15
## - Residence_typeUrban         1   513.35 541.35
## - ever_marriedYes             1   513.40 541.40
## <none>                            512.41 542.41
## - work_typeGovt_job           1   516.03 544.03
## - work_typePrivate            1   516.04 544.04
## - work_typeSelf.employed      1   516.12 544.12
## - avg_glucose_level           1   517.06 545.06
## - age                         1   559.77 587.77
## 
## Step:  AIC=540.41
## stroke ~ genderFemale + age + ever_marriedYes + work_typeGovt_job + 
##     work_typeNever_worked + work_typePrivate + work_typeSelf.employed + 
##     Residence_typeUrban + avg_glucose_level + bmi + smoking_statusnever.smoked + 
##     smoking_statussmokes + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - genderFemale                1   512.45 538.45
## - smoking_statussmokes        1   512.64 538.64
## - bmi                         1   512.76 538.76
## - smoking_statusnever.smoked  1   512.79 538.79
## - work_typeNever_worked       1   512.84 538.84
## - smoking_statusUnknown       1   513.16 539.16
## - Residence_typeUrban         1   513.35 539.35
## - ever_marriedYes             1   513.40 539.40
## <none>                            512.41 540.41
## - work_typeGovt_job           1   516.04 542.04
## - work_typePrivate            1   516.05 542.05
## - work_typeSelf.employed      1   516.13 542.13
## - avg_glucose_level           1   517.15 543.15
## - age                         1   562.08 588.08
## 
## Step:  AIC=538.45
## stroke ~ age + ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - smoking_statussmokes        1   512.68 536.68
## - bmi                         1   512.80 536.80
## - smoking_statusnever.smoked  1   512.82 536.82
## - work_typeNever_worked       1   512.88 536.88
## - smoking_statusUnknown       1   513.22 537.22
## - Residence_typeUrban         1   513.38 537.38
## - ever_marriedYes             1   513.44 537.44
## <none>                            512.45 538.45
## - work_typeGovt_job           1   516.09 540.09
## - work_typePrivate            1   516.10 540.10
## - work_typeSelf.employed      1   516.17 540.17
## - avg_glucose_level           1   517.16 541.16
## - age                         1   562.09 586.09
## 
## Step:  AIC=536.68
## stroke ~ age + ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - bmi                         1   513.02 535.02
## - work_typeNever_worked       1   513.11 535.11
## - smoking_statusnever.smoked  1   513.56 535.56
## - ever_marriedYes             1   513.59 535.59
## - Residence_typeUrban         1   513.60 535.60
## - smoking_statusUnknown       1   514.00 536.00
## <none>                            512.68 536.68
## - work_typeGovt_job           1   516.22 538.22
## - work_typePrivate            1   516.24 538.24
## - work_typeSelf.employed      1   516.31 538.31
## - avg_glucose_level           1   517.43 539.43
## - age                         1   562.46 584.46
## 
## Step:  AIC=535.02
## stroke ~ age + ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + smoking_statusnever.smoked + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - work_typeNever_worked       1   513.45 533.45
## - ever_marriedYes             1   513.91 533.91
## - Residence_typeUrban         1   513.96 533.96
## - smoking_statusnever.smoked  1   513.99 533.99
## - smoking_statusUnknown       1   514.38 534.38
## <none>                            513.02 535.02
## - work_typeGovt_job           1   516.53 536.53
## - work_typePrivate            1   516.54 536.54
## - work_typeSelf.employed      1   516.62 536.62
## - avg_glucose_level           1   518.21 538.21
## - age                         1   562.58 582.58
## 
## Step:  AIC=533.45
## stroke ~ age + ever_marriedYes + work_typeGovt_job + work_typePrivate + 
##     work_typeSelf.employed + Residence_typeUrban + avg_glucose_level + 
##     smoking_statusnever.smoked + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - ever_marriedYes             1   514.34 532.34
## - Residence_typeUrban         1   514.40 532.40
## - smoking_statusnever.smoked  1   514.45 532.45
## - smoking_statusUnknown       1   514.75 532.75
## <none>                            513.45 533.45
## - work_typeGovt_job           1   516.57 534.57
## - work_typePrivate            1   516.58 534.58
## - work_typeSelf.employed      1   516.66 534.66
## - avg_glucose_level           1   518.67 536.67
## - age                         1   562.91 580.91
## 
## Step:  AIC=532.34
## stroke ~ age + work_typeGovt_job + work_typePrivate + work_typeSelf.employed + 
##     Residence_typeUrban + avg_glucose_level + smoking_statusnever.smoked + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - Residence_typeUrban         1   515.23 531.23
## - smoking_statusnever.smoked  1   515.40 531.40
## - smoking_statusUnknown       1   515.73 531.73
## <none>                            514.34 532.34
## - work_typeGovt_job           1   516.99 532.99
## - work_typePrivate            1   517.00 533.00
## - work_typeSelf.employed      1   517.09 533.09
## - avg_glucose_level           1   519.65 535.65
## - age                         1   573.12 589.12
## 
## Step:  AIC=531.23
## stroke ~ age + work_typeGovt_job + work_typePrivate + work_typeSelf.employed + 
##     avg_glucose_level + smoking_statusnever.smoked + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - smoking_statusnever.smoked  1   516.47 530.47
## - smoking_statusUnknown       1   516.71 530.71
## <none>                            515.23 531.23
## - work_typeGovt_job           1   517.90 531.90
## - work_typePrivate            1   517.92 531.92
## - work_typeSelf.employed      1   518.02 532.02
## - avg_glucose_level           1   520.61 534.61
## - age                         1   574.46 588.46
## 
## Step:  AIC=530.47
## stroke ~ age + work_typeGovt_job + work_typePrivate + work_typeSelf.employed + 
##     avg_glucose_level + smoking_statusUnknown
## 
##                          Df Deviance    AIC
## - smoking_statusUnknown   1   517.18 529.18
## <none>                        516.47 530.47
## - work_typePrivate        1   519.03 531.03
## - work_typeGovt_job       1   519.05 531.05
## - work_typeSelf.employed  1   519.13 531.13
## - avg_glucose_level       1   521.93 533.93
## - age                     1   575.65 587.65
## 
## Step:  AIC=529.18
## stroke ~ age + work_typeGovt_job + work_typePrivate + work_typeSelf.employed + 
##     avg_glucose_level
## 
##                          Df Deviance    AIC
## <none>                        517.18 529.18
## - work_typePrivate        1   519.55 529.55
## - work_typeGovt_job       1   519.58 529.58
## - work_typeSelf.employed  1   519.66 529.66
## - avg_glucose_level       1   522.99 532.99
## - age                     1   576.63 586.63
summary(finallogmodel_risk)
## 
## Call:
## glm(formula = stroke ~ age + work_typeGovt_job + work_typePrivate + 
##     work_typeSelf.employed + avg_glucose_level, family = "binomial", 
##     data = bmi_risk_data_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9352  -0.4148  -0.2668  -0.1558   3.0153  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -4.3737     1.0239  -4.272 1.94e-05 ***
## age                      5.1006     0.7575   6.733 1.66e-11 ***
## work_typeGovt_job       -2.2273     1.1706  -1.903   0.0571 .  
## work_typePrivate        -2.1811     1.1378  -1.917   0.0552 .  
## work_typeSelf.employed  -2.2807     1.1762  -1.939   0.0525 .  
## avg_glucose_level        1.1048     0.4488   2.462   0.0138 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 600.22  on 1176  degrees of freedom
## Residual deviance: 517.18  on 1171  degrees of freedom
## AIC: 529.18
## 
## Number of Fisher Scoring iterations: 6
risk_Logpred <- predict(finallogmodel_risk, bmi_risk_data_norm[[2]], type= "response")
risk_Logpred = ifelse(risk_Logpred>.4, 0, 1)

#Safe
masslogmodel_safe <- glm(stroke ~., data = bmi_safe_data_norm[[1]], family = "binomial")
summary(masslogmodel_safe)
## 
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = bmi_safe_data_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1807  -0.2445  -0.1060  -0.0632   3.7037  
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -7.06567    1.06990  -6.604    4e-11 ***
## genderFemale                -0.04952    0.23547  -0.210   0.8334    
## genderMale                        NA         NA      NA       NA    
## age                          7.49208    0.80106   9.353   <2e-16 ***
## hypertension                 0.51112    0.25360   2.015   0.0439 *  
## heart_disease                0.34199    0.31082   1.100   0.2712    
## ever_marriedYes             -0.65488    0.31078  -2.107   0.0351 *  
## work_typeGovt_job           -1.76484    1.25268  -1.409   0.1589    
## work_typeNever_worked      -11.18365  662.23547  -0.017   0.9865    
## work_typePrivate            -1.21031    1.21450  -0.997   0.3190    
## work_typeSelf.employed      -1.62029    1.25291  -1.293   0.1959    
## Residence_typeUrban          0.03876    0.22090   0.175   0.8607    
## avg_glucose_level            0.63873    0.40859   1.563   0.1180    
## bmi                          1.31133    1.30318   1.006   0.3143    
## smoking_statusnever.smoked  -0.24356    0.27684  -0.880   0.3790    
## smoking_statussmokes        -0.09124    0.36606  -0.249   0.8032    
## smoking_statusUnknown       -0.07114    0.32323  -0.220   0.8258    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 858.46  on 2654  degrees of freedom
## Residual deviance: 635.79  on 2639  degrees of freedom
## AIC: 667.79
## 
## Number of Fisher Scoring iterations: 15
finallogmodel_safe <- step(masslogmodel_safe)
## Start:  AIC=667.79
## stroke ~ genderFemale + genderMale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
## 
## Step:  AIC=667.79
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + Residence_typeUrban + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - Residence_typeUrban         1   635.82 665.82
## - genderFemale                1   635.83 665.83
## - smoking_statusUnknown       1   635.84 665.84
## - smoking_statussmokes        1   635.85 665.85
## - work_typeNever_worked       1   635.91 665.91
## - smoking_statusnever.smoked  1   636.56 666.56
## - work_typePrivate            1   636.61 666.61
## - bmi                         1   636.79 666.79
## - heart_disease               1   636.95 666.95
## - work_typeSelf.employed      1   637.11 667.11
## - work_typeGovt_job           1   637.32 667.32
## <none>                            635.79 667.79
## - avg_glucose_level           1   638.19 668.19
## - hypertension                1   639.66 669.66
## - ever_marriedYes             1   639.89 669.89
## - age                         1   759.18 789.18
## 
## Step:  AIC=665.82
## stroke ~ genderFemale + age + hypertension + heart_disease + 
##     ever_marriedYes + work_typeGovt_job + work_typeNever_worked + 
##     work_typePrivate + work_typeSelf.employed + avg_glucose_level + 
##     bmi + smoking_statusnever.smoked + smoking_statussmokes + 
##     smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - genderFemale                1   635.86 663.86
## - smoking_statusUnknown       1   635.87 663.87
## - smoking_statussmokes        1   635.88 663.88
## - work_typeNever_worked       1   635.94 663.94
## - smoking_statusnever.smoked  1   636.59 664.59
## - work_typePrivate            1   636.64 664.64
## - bmi                         1   636.83 664.83
## - heart_disease               1   636.98 664.98
## - work_typeSelf.employed      1   637.15 665.15
## - work_typeGovt_job           1   637.36 665.36
## <none>                            635.82 665.82
## - avg_glucose_level           1   638.21 666.21
## - hypertension                1   639.66 667.66
## - ever_marriedYes             1   640.00 668.00
## - age                         1   759.85 787.85
## 
## Step:  AIC=663.86
## stroke ~ age + hypertension + heart_disease + ever_marriedYes + 
##     work_typeGovt_job + work_typeNever_worked + work_typePrivate + 
##     work_typeSelf.employed + avg_glucose_level + bmi + smoking_statusnever.smoked + 
##     smoking_statussmokes + smoking_statusUnknown
## 
##                              Df Deviance    AIC
## - smoking_statusUnknown       1   635.92 661.92
## - smoking_statussmokes        1   635.92 661.92
## - work_typeNever_worked       1   635.98 661.98
## - work_typePrivate            1   636.69 662.69
## - smoking_statusnever.smoked  1   636.72 662.72
## - bmi                         1   636.87 662.87
## - heart_disease               1   637.08 663.08
## - work_typeSelf.employed      1   637.20 663.20
## - work_typeGovt_job           1   637.41 663.41
## <none>                            635.86 663.86
## - avg_glucose_level           1   638.36 664.36
## - hypertension                1   639.68 665.68
## - ever_marriedYes             1   640.00 666.00
## - age                         1   760.35 786.35
## 
## Step:  AIC=661.92
## stroke ~ age + hypertension + heart_disease + ever_marriedYes + 
##     work_typeGovt_job + work_typeNever_worked + work_typePrivate + 
##     work_typeSelf.employed + avg_glucose_level + bmi + smoking_statusnever.smoked + 
##     smoking_statussmokes
## 
##                              Df Deviance    AIC
## - smoking_statussmokes        1   635.95 659.95
## - work_typeNever_worked       1   636.03 660.03
## - work_typePrivate            1   636.72 660.72
## - smoking_statusnever.smoked  1   636.78 660.78
## - bmi                         1   636.93 660.93
## - heart_disease               1   637.12 661.12
## - work_typeSelf.employed      1   637.22 661.22
## - work_typeGovt_job           1   637.43 661.43
## <none>                            635.92 661.92
## - avg_glucose_level           1   638.54 662.54
## - hypertension                1   639.85 663.85
## - ever_marriedYes             1   640.06 664.06
## - age                         1   760.79 784.79
## 
## Step:  AIC=659.95
## stroke ~ age + hypertension + heart_disease + ever_marriedYes + 
##     work_typeGovt_job + work_typeNever_worked + work_typePrivate + 
##     work_typeSelf.employed + avg_glucose_level + bmi + smoking_statusnever.smoked
## 
##                              Df Deviance    AIC
## - work_typeNever_worked       1   636.06 658.06
## - work_typePrivate            1   636.78 658.78
## - smoking_statusnever.smoked  1   636.80 658.80
## - bmi                         1   636.97 658.97
## - heart_disease               1   637.12 659.12
## - work_typeSelf.employed      1   637.29 659.29
## - work_typeGovt_job           1   637.51 659.51
## <none>                            635.95 659.95
## - avg_glucose_level           1   638.58 660.58
## - hypertension                1   639.89 661.89
## - ever_marriedYes             1   640.13 662.13
## - age                         1   764.67 786.67
## 
## Step:  AIC=658.06
## stroke ~ age + hypertension + heart_disease + ever_marriedYes + 
##     work_typeGovt_job + work_typePrivate + work_typeSelf.employed + 
##     avg_glucose_level + bmi + smoking_statusnever.smoked
## 
##                              Df Deviance    AIC
## - work_typePrivate            1   636.83 656.83
## - smoking_statusnever.smoked  1   636.92 656.92
## - bmi                         1   637.07 657.07
## - heart_disease               1   637.24 657.24
## - work_typeSelf.employed      1   637.32 657.32
## - work_typeGovt_job           1   637.53 657.53
## <none>                            636.06 658.06
## - avg_glucose_level           1   638.70 658.70
## - hypertension                1   640.01 660.01
## - ever_marriedYes             1   640.25 660.25
## - age                         1   764.71 784.71
## 
## Step:  AIC=656.83
## stroke ~ age + hypertension + heart_disease + ever_marriedYes + 
##     work_typeGovt_job + work_typeSelf.employed + avg_glucose_level + 
##     bmi + smoking_statusnever.smoked
## 
##                              Df Deviance    AIC
## - bmi                         1   637.56 655.56
## - smoking_statusnever.smoked  1   637.74 655.74
## - heart_disease               1   638.00 656.00
## <none>                            636.83 656.83
## - work_typeSelf.employed      1   639.38 657.38
## - avg_glucose_level           1   639.56 657.56
## - work_typeGovt_job           1   639.57 657.57
## - hypertension                1   640.80 658.80
## - ever_marriedYes             1   641.71 659.71
## - age                         1   778.39 796.39
## 
## Step:  AIC=655.56
## stroke ~ age + hypertension + heart_disease + ever_marriedYes + 
##     work_typeGovt_job + work_typeSelf.employed + avg_glucose_level + 
##     smoking_statusnever.smoked
## 
##                              Df Deviance    AIC
## - smoking_statusnever.smoked  1   638.48 654.48
## - heart_disease               1   638.63 654.63
## <none>                            637.56 655.56
## - work_typeSelf.employed      1   640.10 656.10
## - work_typeGovt_job           1   640.28 656.28
## - avg_glucose_level           1   641.37 657.37
## - hypertension                1   641.84 657.84
## - ever_marriedYes             1   642.37 658.37
## - age                         1   778.44 794.44
## 
## Step:  AIC=654.48
## stroke ~ age + hypertension + heart_disease + ever_marriedYes + 
##     work_typeGovt_job + work_typeSelf.employed + avg_glucose_level
## 
##                          Df Deviance    AIC
## - heart_disease           1   639.79 653.79
## <none>                        638.48 654.48
## - work_typeSelf.employed  1   641.14 655.14
## - work_typeGovt_job       1   641.19 655.19
## - avg_glucose_level       1   642.18 656.18
## - hypertension            1   642.36 656.36
## - ever_marriedYes         1   643.03 657.03
## - age                     1   778.48 792.48
## 
## Step:  AIC=653.79
## stroke ~ age + hypertension + ever_marriedYes + work_typeGovt_job + 
##     work_typeSelf.employed + avg_glucose_level
## 
##                          Df Deviance    AIC
## <none>                        639.79 653.79
## - work_typeGovt_job       1   642.46 654.46
## - work_typeSelf.employed  1   642.59 654.59
## - hypertension            1   643.95 655.95
## - avg_glucose_level       1   644.20 656.20
## - ever_marriedYes         1   644.74 656.74
## - age                     1   792.97 804.97
summary(finallogmodel_safe)
## 
## Call:
## glm(formula = stroke ~ age + hypertension + ever_marriedYes + 
##     work_typeGovt_job + work_typeSelf.employed + avg_glucose_level, 
##     family = "binomial", data = bmi_safe_data_norm[[1]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1778  -0.2517  -0.1152  -0.0605   3.9181  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -7.8437     0.5889 -13.320   <2e-16 ***
## age                      7.2238     0.6896  10.476   <2e-16 ***
## hypertension             0.5206     0.2484   2.095   0.0361 *  
## ever_marriedYes         -0.6970     0.3004  -2.320   0.0203 *  
## work_typeGovt_job       -0.5602     0.3608  -1.553   0.1205    
## work_typeSelf.employed  -0.4221     0.2565  -1.646   0.0998 .  
## avg_glucose_level        0.8181     0.3829   2.137   0.0326 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 858.46  on 2654  degrees of freedom
## Residual deviance: 639.79  on 2648  degrees of freedom
## AIC: 653.79
## 
## Number of Fisher Scoring iterations: 8
safe_Logpred <- predict(finallogmodel_safe, bmi_safe_data_norm[[2]], type= "response")
safe_Logpred = ifelse(safe_Logpred>.0004, 0, 1)

#Combine Predictions (The thresholds above gave us the highest kappa statistic)
bmi_combined_Logpred = c(risk_Logpred, safe_Logpred)

bmi_true_label = c(bmi_risk_data_norm[[2]]$stroke, bmi_safe_data_norm[[2]]$stroke)

confusionMatrix(as.factor(bmi_combined_Logpred), as.factor(bmi_true_label))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 855  30
##          1 357  35
##                                           
##                Accuracy : 0.6969          
##                  95% CI : (0.6709, 0.7221)
##     No Information Rate : 0.9491          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0722          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.70545         
##             Specificity : 0.53846         
##          Pos Pred Value : 0.96610         
##          Neg Pred Value : 0.08929         
##              Prevalence : 0.94910         
##          Detection Rate : 0.66954         
##    Detection Prevalence : 0.69303         
##       Balanced Accuracy : 0.62195         
##                                           
##        'Positive' Class : 0               
## 

When we split based on BMI, we see even worse results with the model. The accuracy drops all the way down to .6969 while the kappa statistic plummets to .0722. We do see a similar idea where there are different significant variables when predicting “stroke” between the “safe BMI” and “risky BMI” group. However, the confusion matrix output leads us to believe that predicting stroke based on different BMI categories is less effective than age or a general model. The thresholds for the BMI splits were created in similar fashion to the age splits where they were set to optimize kappa but ultimately led to predicting all “stroke” or “no stroke” depending on the BMI category.

ANN Model

ANNmodel <- neuralnet(general_norm[[1]]$stroke ~ ., data=general_norm[[1]], hidden = 2, stepmax = 1000000)
plot(ANNmodel, rep = "best")

# obtain model results
ann_results <- compute(ANNmodel, general_norm[[2]])
ann_predicted_stroke <- ann_results$net.result
cutoff = quantile(ann_predicted_stroke, 1-mean(general_norm[[1]]$stroke))
ann_predicted_stroke <- ifelse(ann_predicted_stroke > cutoff, 1, 0)

confusionMatrix(as.factor(ann_predicted_stroke), as.factor(general_norm[[2]]$stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1161   55
##          1   50   11
##                                           
##                Accuracy : 0.9178          
##                  95% CI : (0.9013, 0.9323)
##     No Information Rate : 0.9483          
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.13            
##                                           
##  Mcnemar's Test P-Value : 0.6963          
##                                           
##             Sensitivity : 0.9587          
##             Specificity : 0.1667          
##          Pos Pred Value : 0.9548          
##          Neg Pred Value : 0.1803          
##              Prevalence : 0.9483          
##          Detection Rate : 0.9092          
##    Detection Prevalence : 0.9522          
##       Balanced Accuracy : 0.5627          
##                                           
##        'Positive' Class : 0               
## 

Our general ANN model does a poor job in predicting stroke for patients. It has a kappa value of 0.1466, which is comparable to the other individual models.The accuracy of the model is 91.93% overall, and out of 66 patients who do have stroke, our model predicted only 12 of them correctly. This is worrying, especially with so many false negatives. Next we want to see if splitting the data based on age and BMI would increase the model’s accuracy or kappa statistic for our ANN model.

ANN Age Based Model

#Young
ANNyoungmodel <- neuralnet(age_young_data_norm[[1]]$stroke ~ ., data=age_young_data_norm[[1]], hidden = 2)
plot(ANNyoungmodel, rep = "best")

# obtain model results
annageyoung_results <- compute(ANNyoungmodel, age_young_data_norm[[2]])
predicted_strokeyoung <- annageyoung_results$net.result
cutoff = quantile(predicted_strokeyoung, 1-mean(age_young_data_norm[[1]]$stroke))
predicted_strokeyoung <- ifelse(predicted_strokeyoung > cutoff, 1, 0)

# Accuracy, Kappa, etc.
confusionMatrix(as.factor(predicted_strokeyoung), as.factor(age_young_data_norm[[2]]$stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 834   8
##          1  11   0
##                                           
##                Accuracy : 0.9777          
##                  95% CI : (0.9654, 0.9865)
##     No Information Rate : 0.9906          
##     P-Value [Acc > NIR] : 0.9998          
##                                           
##                   Kappa : -0.011          
##                                           
##  Mcnemar's Test P-Value : 0.6464          
##                                           
##             Sensitivity : 0.9870          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9905          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.9906          
##          Detection Rate : 0.9777          
##    Detection Prevalence : 0.9871          
##       Balanced Accuracy : 0.4935          
##                                           
##        'Positive' Class : 0               
## 
#Old
ANNoldmodel <- neuralnet(age_old_data_norm[[1]]$stroke ~ ., data=age_old_data_norm[[1]],hidden = 2)
plot(ANNoldmodel, rep = "best")

# obtain model results
annageold_results <- compute(ANNoldmodel, age_old_data_norm[[2]])
predicted_strokeold <- annageold_results$net.result
cutoff = quantile(predicted_strokeold, 1-mean(age_old_data_norm[[1]]$stroke))
predicted_strokeold <- ifelse(predicted_strokeold > cutoff, 1, 0)

# Accuracy, Kappa, etc.
confusionMatrix(as.factor(predicted_strokeold), as.factor(age_old_data_norm[[2]]$stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 324  50
##          1  39  11
##                                           
##                Accuracy : 0.7901          
##                  95% CI : (0.7482, 0.8279)
##     No Information Rate : 0.8561          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.0788          
##                                           
##  Mcnemar's Test P-Value : 0.2891          
##                                           
##             Sensitivity : 0.8926          
##             Specificity : 0.1803          
##          Pos Pred Value : 0.8663          
##          Neg Pred Value : 0.2200          
##              Prevalence : 0.8561          
##          Detection Rate : 0.7642          
##    Detection Prevalence : 0.8821          
##       Balanced Accuracy : 0.5364          
##                                           
##        'Positive' Class : 0               
## 
#Combined Model
combined_predAge = c(predicted_strokeold, predicted_strokeyoung)
true_label = c(age_old_data_norm[[2]]$stroke, age_young_data_norm[[2]]$stroke)
confusionMatrix(as.factor(combined_predAge), as.factor(true_label))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1158   58
##          1   50   11
##                                           
##                Accuracy : 0.9154          
##                  95% CI : (0.8988, 0.9301)
##     No Information Rate : 0.946           
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.1249          
##                                           
##  Mcnemar's Test P-Value : 0.5006          
##                                           
##             Sensitivity : 0.9586          
##             Specificity : 0.1594          
##          Pos Pred Value : 0.9523          
##          Neg Pred Value : 0.1803          
##              Prevalence : 0.9460          
##          Detection Rate : 0.9068          
##    Detection Prevalence : 0.9522          
##       Balanced Accuracy : 0.5590          
##                                           
##        'Positive' Class : 0               
## 

While the overall accuracy (91.39%) took a slight dip, splitting by Age worsened the ANN model marginally. Now, the Kappa statistic has increased to 0.1086. Now we will see if splitting based on BMI makes any difference in these results.

ANN BMI Based Model

#Risk
ANNriskmodel <- neuralnet(bmi_risk_data_norm[[1]]$stroke ~ ., data=bmi_risk_data_norm[[1]],hidden =2)
plot(ANNriskmodel, rep = "best")

# obtain model results
annbmirisk_results <- compute(ANNriskmodel, bmi_risk_data_norm[[2]])
predicted_strokerisk <- annbmirisk_results$net.result
cutoff = quantile(predicted_strokerisk, 1-mean(bmi_risk_data_norm[[1]]$stroke))
predicted_strokerisk <- ifelse(predicted_strokerisk > cutoff, 1, 0)

# Accuracy, Kappa, etc.
confusionMatrix(as.factor(predicted_strokerisk), as.factor(bmi_risk_data_norm[[2]]$stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 334  30
##          1  23   5
##                                           
##                Accuracy : 0.8648          
##                  95% CI : (0.8269, 0.8971)
##     No Information Rate : 0.9107          
##     P-Value [Acc > NIR] : 0.9990          
##                                           
##                   Kappa : 0.0862          
##                                           
##  Mcnemar's Test P-Value : 0.4098          
##                                           
##             Sensitivity : 0.9356          
##             Specificity : 0.1429          
##          Pos Pred Value : 0.9176          
##          Neg Pred Value : 0.1786          
##              Prevalence : 0.9107          
##          Detection Rate : 0.8520          
##    Detection Prevalence : 0.9286          
##       Balanced Accuracy : 0.5392          
##                                           
##        'Positive' Class : 0               
## 
#Safe
ANNsafemodel <- neuralnet(bmi_safe_data_norm[[1]]$stroke ~ ., data=bmi_safe_data_norm[[1]],hidden =2)
plot(ANNsafemodel, rep = "best")

# obtain model results
annbmisafe_results <- compute(ANNsafemodel, bmi_safe_data_norm[[2]])
predicted_strokesafe <- annbmisafe_results$net.result
cutoff = quantile(predicted_strokesafe, 1-mean(bmi_safe_data_norm[[1]]$stroke))
predicted_strokesafe <- ifelse(predicted_strokesafe > cutoff, 1, 0)

# Accuracy, Kappa, etc.
confusionMatrix(as.factor(predicted_strokesafe), as.factor(bmi_safe_data_norm[[2]]$stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 830  21
##          1  25   9
##                                           
##                Accuracy : 0.948           
##                  95% CI : (0.9313, 0.9617)
##     No Information Rate : 0.9661          
##     P-Value [Acc > NIR] : 0.9979          
##                                           
##                   Kappa : 0.2544          
##                                           
##  Mcnemar's Test P-Value : 0.6583          
##                                           
##             Sensitivity : 0.9708          
##             Specificity : 0.3000          
##          Pos Pred Value : 0.9753          
##          Neg Pred Value : 0.2647          
##              Prevalence : 0.9661          
##          Detection Rate : 0.9379          
##    Detection Prevalence : 0.9616          
##       Balanced Accuracy : 0.6354          
##                                           
##        'Positive' Class : 0               
## 
#Combined Model
combined_predbmi= c(predicted_strokerisk, predicted_strokesafe)
true_label = c(bmi_risk_data_norm[[2]]$stroke, bmi_safe_data_norm[[2]]$stroke)
confusionMatrix(as.factor(combined_predbmi), as.factor(true_label))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1164   51
##          1   48   14
##                                           
##                Accuracy : 0.9225          
##                  95% CI : (0.9064, 0.9365)
##     No Information Rate : 0.9491          
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.1797          
##                                           
##  Mcnemar's Test P-Value : 0.8407          
##                                           
##             Sensitivity : 0.9604          
##             Specificity : 0.2154          
##          Pos Pred Value : 0.9580          
##          Neg Pred Value : 0.2258          
##              Prevalence : 0.9491          
##          Detection Rate : 0.9115          
##    Detection Prevalence : 0.9514          
##       Balanced Accuracy : 0.5879          
##                                           
##        'Positive' Class : 0               
## 

Similar to the previous Age-based model, the BMI-based ANN Model slightly increases the accuracy (now 92.72%), although this time the kappa statistic improves to 0.2294, the highest of all the models so far. These figures still are not promising, given that only 17 strokes were accurately predicted.

Overall, it seems like the ANN model did have the best kappa statistic when splitting by BMI. However, the gain in kappa statistic doesn’t neccessarily translate to more lives saved, since the false negative rate is still quite high.

Stacked Model

Next, we wanted to run a stacked model in order to see if a model that combines the findings of each individual model can more accurately predict stroke.

stackedmodel = data.frame(decision_tree = stroke_dt_pred, ann = ann_predicted_stroke, knn = stroke_knn_test_pred, logistic_reg = LogReg_Pred, svm = svm_pred, Stroke = general_norm[[2]]$stroke)

str(svm_pred)
##  num [1:1277, 1] 0 0 0 0 0 0 0 0 0 0 ...
set.seed(42)

testrows <- sample(1:nrow(stackedmodel),950)
stackedtrain <- stackedmodel[-testrows, ]
stackedtest <- stackedmodel[testrows, ]

stackedmodel$decision_tree <- as.numeric(stackedmodel$decision_tree)
stackedmodel$knn <- as.numeric(stackedmodel$knn)

str(stackedmodel)
## 'data.frame':    1277 obs. of  6 variables:
##  $ decision_tree: num  1 1 1 2 1 2 2 1 2 1 ...
##  $ ann          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ knn          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ logistic_reg : num  0 0 0 0 0 0 1 0 1 0 ...
##  $ svm          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Stroke       : num  0 0 0 0 0 0 0 0 0 0 ...
# Decision Tree based on Train and Test Split
#build model

stacked_dt <- C5.0(as.factor(Stroke) ~ ., data = stackedtrain, costs = error_cost)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(stacked_dt)

summary(stacked_dt)
## 
## Call:
## C5.0.formula(formula = as.factor(Stroke) ~ ., data = stackedtrain, costs
##  = error_cost)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr 27 16:28:16 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 327 cases (6 attributes) from undefined.data
## Read misclassification costs from undefined.costs
## 
## Decision tree:
## 
## decision_tree = 1: 1 (92/76)
## decision_tree = 0:
## :...logistic_reg <= 0: 0 (231/2)
##     logistic_reg > 0: 1 (4/3)
## 
## 
## Evaluation on training data (327 cases):
## 
##         Decision Tree       
##    -----------------------  
##    Size      Errors   Cost  
## 
##       3   81(24.8%)   0.36   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     229    79    (a): class 0
##       2    17    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% decision_tree
##   71.87% logistic_reg
## 
## 
## Time: 0.0 secs
#predict the test data
stacked_dt_pred <- predict(stacked_dt, stackedtest)

#Evaluate prediction results
confusionMatrix(stacked_dt_pred, as.factor(stackedtest$Stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 634  10
##          1 269  37
##                                           
##                Accuracy : 0.7063          
##                  95% CI : (0.6762, 0.7351)
##     No Information Rate : 0.9505          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1355          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7021          
##             Specificity : 0.7872          
##          Pos Pred Value : 0.9845          
##          Neg Pred Value : 0.1209          
##              Prevalence : 0.9505          
##          Detection Rate : 0.6674          
##    Detection Prevalence : 0.6779          
##       Balanced Accuracy : 0.7447          
##                                           
##        'Positive' Class : 0               
## 

Based on the stacked model, it seems that the model is mainly making its decisions off of the decision tree and logistic regression model, but mostly the decision tree model. However, even with all the models’ predictions running as variables, the kappa and accuracy statistic of the stacked model does not out perform that of the ANN model individually. The kappa statistic of the stacked model is comparable to the individual performance of the decision tree model or SVM model.

Stacked Model Age Based Model

stackedmodelage = data.frame(decision_tree = combinedagepredict, ann = combined_predAge, knn = knn_age_combined_pred, logistic_reg = age_combined_Logpred, svm = combined_svmage_pred, Stroke = true_label_age)

str(stackedmodelage)
## 'data.frame':    1277 obs. of  6 variables:
##  $ decision_tree: num  0 0 1 1 1 0 1 1 1 0 ...
##  $ ann          : num  0 0 1 0 0 0 0 1 1 0 ...
##  $ knn          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ logistic_reg : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ svm          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Stroke       : num  0 0 0 0 0 0 0 0 0 0 ...
set.seed(42)

agetestrows <- sample(1:nrow(stackedmodelage),950)
stackedagetrain <- stackedmodelage[-testrows, ]
stackedagetest <- stackedmodelage[testrows, ]

stackedmodelage$decision_tree <- as.numeric(stackedmodelage$decision_tree)
stackedmodelage$knn <- as.numeric(stackedmodelage$knn)

str(stackedmodelage)
## 'data.frame':    1277 obs. of  6 variables:
##  $ decision_tree: num  0 0 1 1 1 0 1 1 1 0 ...
##  $ ann          : num  0 0 1 0 0 0 0 1 1 0 ...
##  $ knn          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ logistic_reg : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ svm          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Stroke       : num  0 0 0 0 0 0 0 0 0 0 ...
# Decision Tree based on Train and Test Split
#build model

stacked_age_dt <- C5.0(as.factor(Stroke) ~ ., data = stackedagetrain, costs = error_cost)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(stacked_age_dt)

summary(stacked_age_dt)
## 
## Call:
## C5.0.formula(formula = as.factor(Stroke) ~ ., data = stackedagetrain, costs
##  = error_cost)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr 27 16:28:16 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 327 cases (6 attributes) from undefined.data
## Read misclassification costs from undefined.costs
## 
## Decision tree:
## 
## logistic_reg <= 0: 0 (210/3)
## logistic_reg > 0: 1 (117/95)
## 
## 
## Evaluation on training data (327 cases):
## 
##         Decision Tree       
##    -----------------------  
##    Size      Errors   Cost  
## 
##       2   98(30.0%)   0.47   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     207    95    (a): class 0
##       3    22    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% logistic_reg
## 
## 
## Time: 0.0 secs
#predict the test data
stacked_age_dt_pred <- predict(stacked_age_dt, stackedagetest)

#Evaluate prediction results
confusionMatrix(stacked_age_dt_pred, as.factor(stackedagetest$Stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 638   5
##          1 268  39
##                                           
##                Accuracy : 0.7126          
##                  95% CI : (0.6827, 0.7412)
##     No Information Rate : 0.9537          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1537          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7042          
##             Specificity : 0.8864          
##          Pos Pred Value : 0.9922          
##          Neg Pred Value : 0.1270          
##              Prevalence : 0.9537          
##          Detection Rate : 0.6716          
##    Detection Prevalence : 0.6768          
##       Balanced Accuracy : 0.7953          
##                                           
##        'Positive' Class : 0               
## 

We see that the stacked model does not perform well when we split on age. It has an accuracy and kappa of 0.7126 and 0.1537 respectively, which has lower accuracy and comparable kappa statistic to our other models. This is probably due to the model overly predicting “stroke”. Our model is probably doing this because of the error cost for false negatives being quite high, yet it is similar to the cost of our original model. The stacked model is probably not optimal for categorized splits. We also see that there is a high no information rate from the split model, so there is something behind the scenes going on with the models when stacking. It could be because our age split models did not perform well in general.

Stacked Model BMI Based Model

stackedmodelbmi = data.frame(decision_tree = combinedbmipredict, ann = combined_predbmi, knn = knn_bmi_combined_pred, logistic_reg = bmi_combined_Logpred, svm = combined_svmbmipred, Stroke = true_label_bmi)

str(stackedmodelbmi)
## 'data.frame':    1277 obs. of  6 variables:
##  $ decision_tree: num  1 1 0 0 0 0 1 0 0 1 ...
##  $ ann          : num  0 1 0 0 1 0 0 0 1 0 ...
##  $ knn          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ logistic_reg : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ svm          : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Stroke       : num  1 0 0 0 1 0 0 0 0 0 ...
set.seed(42)

bmitestrows <- sample(1:nrow(stackedmodelbmi),950)
stackedbmitrain <- stackedmodelbmi[-testrows, ]
stackedbmitest <- stackedmodelbmi[testrows, ]

stackedmodelbmi$decision_tree <- as.numeric(stackedmodelbmi$decision_tree)
stackedmodelbmi$knn <- as.numeric(stackedmodelbmi$knn)

str(stackedmodelbmi)
## 'data.frame':    1277 obs. of  6 variables:
##  $ decision_tree: num  1 1 0 0 0 0 1 0 0 1 ...
##  $ ann          : num  0 1 0 0 1 0 0 0 1 0 ...
##  $ knn          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ logistic_reg : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ svm          : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Stroke       : num  1 0 0 0 1 0 0 0 0 0 ...
# Decision Tree based on Train and Test Split
#build model

stacked_bmi_dt <- C5.0(as.factor(Stroke) ~ ., data = stackedbmitrain, costs = error_cost)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(stacked_bmi_dt)

summary(stacked_bmi_dt)
## 
## Call:
## C5.0.formula(formula = as.factor(Stroke) ~ ., data = stackedbmitrain, costs
##  = error_cost)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr 27 16:28:16 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 327 cases (6 attributes) from undefined.data
## Read misclassification costs from undefined.costs
## 
## Decision tree:
## 
## ann > 0: 1 (22/17)
## ann <= 0:
## :...logistic_reg > 0: 1 (87/79)
##     logistic_reg <= 0:
##     :...svm <= 0: 0 (210/1)
##         svm > 0: 1 (8/6)
## 
## 
## Evaluation on training data (327 cases):
## 
##         Decision Tree       
##    -----------------------  
##    Size      Errors   Cost  
## 
##       4  103(31.5%)   0.37   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     209   102    (a): class 0
##       1    15    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% ann
##   93.27% logistic_reg
##   66.67% svm
## 
## 
## Time: 0.0 secs
#predict the test data
stacked_bmi_dt_pred <- predict(stacked_bmi_dt, stackedbmitest)

#Evaluate prediction results
confusionMatrix(stacked_bmi_dt_pred, as.factor(stackedbmitest$Stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 600  15
##          1 301  34
##                                           
##                Accuracy : 0.6674          
##                  95% CI : (0.6364, 0.6973)
##     No Information Rate : 0.9484          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0957          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6659          
##             Specificity : 0.6939          
##          Pos Pred Value : 0.9756          
##          Neg Pred Value : 0.1015          
##              Prevalence : 0.9484          
##          Detection Rate : 0.6316          
##    Detection Prevalence : 0.6474          
##       Balanced Accuracy : 0.6799          
##                                           
##        'Positive' Class : 0               
## 

We see a similar result from the BMI stacked model. The kappa statistic is quite low at 0.0853, despite using many more of the individual models to get the final prediction. This is probably due to similar reasons outlined above from the age stacked model.

Improved Stacked Model

Next, due to the somewhat light predictability of the stacked model, we wanted to see if we could improve it.

set.seed(42)

#Adding Customization Packages
ctrl <- trainControl(method = "cv", number = 10,
                     selectionFunction = "oneSE")

grid <- expand.grid(.model = "tree",
                    .trials = c(1,5,10,15,20,30,35),
                    .winnow = "FALSE")

#Tune the model
TunedStack <- train(as.factor(Stroke) ~ ., data = stackedmodel, method = "C5.0",
           metric = "Kappa",
           trControl = ctrl,
           tuneGrid = grid
          )
TunedStack
## C5.0 
## 
## 1277 samples
##    5 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1149, 1149, 1150, 1149, 1149, 1150, ... 
## Resampling results across tuning parameters:
## 
##   trials  Accuracy   Kappa
##    1      0.9483323  0    
##    5      0.9483323  0    
##   10      0.9483323  0    
##   15      0.9483323  0    
##   20      0.9483323  0    
##   30      0.9483323  0    
##   35      0.9483323  0    
## 
## Tuning parameter 'model' was held constant at a value of tree
## Tuning
##  parameter 'winnow' was held constant at a value of FALSE
## Kappa was used to select the optimal model using  the one SE rule.
## The final values used for the model were trials = 1, model = tree and winnow
##  = FALSE.
#Preparing data for predict and confusion matrix
stackedtestmm <- as.data.frame(model.matrix(~.-1,stackedtest))
stackedtest_norm <- as.data.frame(lapply(stackedtestmm, normalize))
stackedtest_norm$decision_tree0 <- NULL
names(stackedtest_norm)[names(stackedtest_norm) == "decision_tree1"] <- "decision_tree"
names(stackedtest_norm)[names(stackedtest_norm) == "knn1"] <- "knn"

#Predict and Confusion Matrix
tunedstack_pred <- predict(TunedStack, stackedtest_norm)
confusionMatrix(tunedstack_pred, as.factor(stackedtest_norm$Stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 903  47
##          1   0   0
##                                           
##                Accuracy : 0.9505          
##                  95% CI : (0.9348, 0.9634)
##     No Information Rate : 0.9505          
##     P-Value [Acc > NIR] : 0.5387          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.949e-11       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9505          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9505          
##          Detection Rate : 0.9505          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

We see that even with improving the model, the same issue persists where the model is predicting all of one guess. The only difference this time, is that it is guess all “no-stroke”. This change could be due to the fact that we do not have the error costs included in the creation of the model, therefore the model predicting all “no-stroke” due to the lack of “stroke” in the data set in general.

Improved Stacked Model Age Based Model

set.seed(42)
TunedStackAge <- train(as.factor(Stroke) ~ ., data = stackedmodelage, method = "C5.0",
           metric = "Kappa",
           trControl = ctrl,
           tuneGrid = grid)
TunedStackAge
## C5.0 
## 
## 1277 samples
##    5 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1150, 1149, 1150, 1149, 1149, 1150, ... 
## Resampling results across tuning parameters:
## 
##   trials  Accuracy   Kappa
##    1      0.9459707  0    
##    5      0.9459707  0    
##   10      0.9459707  0    
##   15      0.9459707  0    
##   20      0.9459707  0    
##   30      0.9459707  0    
##   35      0.9459707  0    
## 
## Tuning parameter 'model' was held constant at a value of tree
## Tuning
##  parameter 'winnow' was held constant at a value of FALSE
## Kappa was used to select the optimal model using  the one SE rule.
## The final values used for the model were trials = 1, model = tree and winnow
##  = FALSE.
#Predict and Confusion Matrix
tunedstackAGE_pred <- predict(TunedStackAge, stackedtest_norm)
confusionMatrix(tunedstackAGE_pred, as.factor(stackedtest_norm$Stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 903  47
##          1   0   0
##                                           
##                Accuracy : 0.9505          
##                  95% CI : (0.9348, 0.9634)
##     No Information Rate : 0.9505          
##     P-Value [Acc > NIR] : 0.5387          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.949e-11       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9505          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9505          
##          Detection Rate : 0.9505          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

We see that the improved age model has similar results to the general improved stacked model.

Improved Stacked Model BMI Based Model

set.seed(42)
TunedStackBMI <- train(as.factor(Stroke) ~ ., data = stackedmodelbmi, method = "C5.0",
           metric = "Kappa",
           trControl = ctrl,
           tuneGrid = grid)
TunedStackBMI
## C5.0 
## 
## 1277 samples
##    5 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1149, 1149, 1150, 1149, 1148, 1150, ... 
## Resampling results across tuning parameters:
## 
##   trials  Accuracy  Kappa
##    1      0.949119  0    
##    5      0.949119  0    
##   10      0.949119  0    
##   15      0.949119  0    
##   20      0.949119  0    
##   30      0.949119  0    
##   35      0.949119  0    
## 
## Tuning parameter 'model' was held constant at a value of tree
## Tuning
##  parameter 'winnow' was held constant at a value of FALSE
## Kappa was used to select the optimal model using  the one SE rule.
## The final values used for the model were trials = 1, model = tree and winnow
##  = FALSE.
#Predict and Confusion Matrix
tunedstackBMI_pred <- predict(TunedStackBMI, stackedtest_norm)
confusionMatrix(tunedstackBMI_pred, as.factor(stackedtest_norm$Stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 903  47
##          1   0   0
##                                           
##                Accuracy : 0.9505          
##                  95% CI : (0.9348, 0.9634)
##     No Information Rate : 0.9505          
##     P-Value [Acc > NIR] : 0.5387          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.949e-11       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9505          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9505          
##          Detection Rate : 0.9505          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

Again, we see a similar output. In all of the trials using train, the model only ever predicts 0, both in the training and the test data splits.

Summary

Overall, this project taught us a lot, but before we get into what it has taught us, let’s talk about the insights we could extract from our analysis. In our intro, we talked about how the analysis we get from this project could be applied to predicting if a patient is going to have a stroke or not for both family physicians, hospitals, and insurance companies. We believe that while some of our models are accurate, it is best to use the models as a second opinion. The models are not accurate enough to full rely on them to make decisions. In a world with tons of information flowing and biases influencing decision making choices, a second opinion makes all the difference and can be the difference between life and death. This not only can give more confidence to the first initial prediction (when they match), but it can act as a warning to mismatched predictions, indicating that a deeper dive and more research might be needed in order to make a concrete prediction. Insurance companies and doctors can cross reference between the predictions of the KNN model and the predictions of the decision tree with the prediction of the doctor in order to thoroughly ensure that the right call is being made to the best of their abilities. The KNN model is solely accuracy based, making it a solid second opinion when it comes to deciding whether or not they will have a stroke or not, while the decision tree involves a cost matrix, acting as a solid second opinion consultant for whether to diagnose a patient with risk of stroke or not.

If we were to run the models again, it would be beneficial to run it on a much larger data set, given that the occurrence of “stroke” is low in general. We would need to increase the data set to the point that it includes many more “stroke” incidents. The KNN model would be the only model that we would feel comfortable using as a heavily weighted second opinion, and even then, it does not include error costs of false negatives, which is something that we want to include because the cost of a false negative is much more severe than the cost of a false positive. The other models are not completely useless though. For example the logistic model, decision tree, and ANN model give us an insight on which variables are more indicative of stroke compared to other variables, a very valuable insight.

In terms of an AI Canvas, we have a concrete idea of what we are predicting (stroke or not). In terms of judgment, we have an idea of the costs of false negative and false positives, however, more research and an expert opinion from doctors offices and/or hospital rooms would been need to accurately set a realistic error cost. We know that the action would be to diagnose stroke or not and follow-through on the specific procedures that go along with those diagnoses. The outcome would be to follow-up and see if the patient ultimately has a stroke or not. Input involves collect patients age, BMI, average glucose level, gender, etc. Training would involve train and test data being ran through the models above and feedback would involve feeding the outcomes and more collected data back into those models.

This project taught us a lot of real-life applications to machine learning and predictive analysis. The main lesson we learned was that data input matters. The biggest obstacle we had when running our analysis was the quality and quantity of data within our data set. Our data was very structured and clean. There weren’t many missing entries within each observation, however, the number of observations was lacking. While having a smaller number of observations reduces run-time for the models, it causes the results from the models to be less effective in its predictions. For our project, this was the main issue that we had. While there were a good number of observations, it was not nearly enough. There were not enough positive stroke entries to effectivly train the models to accurately predict “stroke.” Most of the observations in our data set were “no-stroke”. This led to a heavily skewed data set that made it difficult to accurately predict “stroke” due simply to the lack of “strokes” in the data set. Another topic we learned was the importance of setting a cutoff for certain models to predict on. For example, with the logistic regression model and SVM model, it required a cutoff point to predict “stroke” or “no stroke” based on the output probability/log-odds ratio from the model.

In the end, our model does a really good job at providing more insight as well as a second opinion to predicting strokes. Hosiptals, doctors’ practices, and insurance companies can benefit greatly not only from the insights provided by the desciptive models (that give us a more in-depth idea of the variables that significantly effect the probability of strokes) but also provide us with accurate predictions from the KNN model output cross-referenced with the decision tree output.